JSTOR_clusterbywords: Cluster documents by similarities in word frequencies

Description Usage Arguments Value Examples

Description

Generates plots visualizing the results of different clustering methods applied to the documents. For use with JSTOR's Data for Research datasets (http://dfr.jstor.org/). For best results, repeat the function several times after adding common words to the stopword list and excluding them using the JSTOR_removestopwords function.

Usage

1
JSTOR_clusterbywords(nouns, word, custom_stopwords = NULL, f = 0.01)

Arguments

nouns

the object returned by the function JSTOR_dtmofnouns. A corpus containing the documents with stopwords removed.

word

The word or vector of words to subset the documents by, ie. use only documents containing this word (or words) in the cluster analysis

custom_stopwords

character vector of stop words to use in addition to the default set supplied by the tm package

f

A scalar value to filter the total number of words used in the cluster analyses. For each document, the count of each word is divided by the total number of words in that document, expressing a word's frequency as a proportion of all words in that document. This parameter corresponds to the summed proportions of a word in all documents (ie. the column sum for the document term matrix). If f = 0.01 then only words that constitute at least 1.0 percent of all words in all documents will be used for the cluster analyses.

Value

Returns plots of clusters of documents, and dataframes of affinity propogation clustering, k-means and PCA outputs. The plots can be accessed and displayed using the $ function, for example with: cl1$p or plot(cl1$cl_plot) etc.

Examples

1
2
## cl1 <- JSTOR_clusterbywords(nouns, "pirates")
## cl2 <- JSTOR_clusterbywords(nouns, c("pirates", "privateers"))

benmarwick/JSTORr documentation built on May 12, 2019, 12:59 p.m.