View source: R/compare_corpora.r
compare_corpus | R Documentation |
Compare tCorpus vocabulary to that of another (reference) tCorpus
compare_corpus(
tc,
tc_y,
feature,
smooth = 0.1,
min_ratio = NULL,
min_chi2 = NULL,
is_subset = F,
yates_cor = c("auto", "yes", "no"),
what = c("freq", "docfreq", "cooccurrence")
)
tc |
a |
tc_y |
the reference tCorpus |
feature |
the column name of the feature that is to be compared |
smooth |
Laplace smoothing is used for the calculation of the probabilities. Here you can set the added (pseuocount) value. |
min_ratio |
threshold for the ratio value, which is the ratio of the relative frequency of a term in dtm.x and dtm.y |
min_chi2 |
threshold for the chi^2 value |
is_subset |
Specify whether tc is a subset of tc_y. In this case, the term frequencies of tc will be subtracted from the term frequencies in tc_y |
yates_cor |
mode for using yates correctsion in the chi^2 calculation. Can be turned on ("yes") or off ("no"), or set to "auto", in which case cochrans rule is used to determine whether yates' correction is used. |
what |
choose whether to compare the frequency ("freq") of terms, or the document frequency ("docfreq"). This also affects how chi^2 is calculated, comparing either freq relative to vocabulary size or docfreq relative to corpus size (N) |
A vocabularyComparison object
tc = create_tcorpus(sotu_texts, doc_column = 'id')
tc$preprocess('token', 'feature', remove_stopwords = TRUE, use_stemming = TRUE)
obama = tc$subset_meta(president == 'Barack Obama', copy=TRUE)
bush = tc$subset_meta(president == 'George W. Bush', copy=TRUE)
comp = compare_corpus(tc, bush, 'feature')
comp = comp[order(-comp$chi),]
head(comp)
plot(comp)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.