compare_subset: Compare vocabulary of a subset of a tCorpus to the rest of...

Description Usage Arguments Value Examples

View source: R/compare_corpora.r

Description

Compare vocabulary of a subset of a tCorpus to the rest of the tCorpus

Usage

1
2
3
4
compare_subset(tc, feature, subset_x = NULL, subset_meta_x = NULL,
  query_x = NULL, query_feature = "token", smooth = 0.1,
  min_ratio = NULL, min_chi2 = NULL, yates_cor = c("auto", "yes", "no"),
  what = c("freq", "docfreq", "cooccurrence"))

Arguments

tc

a tCorpus

feature

the column name of the feature that is to be compared

subset_x

an expression to subset the tCorpus. The vocabulary of the subset will be compared to the rest of the tCorpus

subset_meta_x

like subset_x, but using using the meta data

query_x

like subset_x, but using a query search to select documents (see tCorpus$search_contexts)

query_feature

if query_x is used, the column name of the feature used in the query search.

smooth

Laplace smoothing is used for the calculation of the ratio of the relative term frequency. Here you can set the added value.

min_ratio

threshold for the ratio value, which is the ratio of the relative frequency of a term in dtm.x and dtm.y

min_chi2

threshold for the chi^2 value

yates_cor

mode for using yates correctsion in the chi^2 calculation. Can be turned on ("yes") or off ("no"), or set to "auto", in which case cochrans rule is used to determine whether yates' correction is used.

what

choose whether to compare the frequency ("freq") of terms, or the document frequency ("docfreq"). This also affects how chi^2 is calculated, comparing either freq relative to vocabulary size or docfreq relative to corpus size (N)

Value

A vocabularyComparison object

Examples

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
tc = create_tcorpus(sotu_texts, doc_column = 'id')

tc$preprocess('token', 'feature', remove_stopwords = TRUE, use_stemming = TRUE)

comp = compare_subset(tc, 'feature', subset_meta_x = president == 'Barack Obama')
comp = comp[order(-comp$chi),]
head(comp)
## Not run: 
plot(comp)

## End(Not run)

comp = compare_subset(tc, 'feature', query_x = 'terroris*')
comp = comp[order(-comp$chi),]
head(comp, 10)

kasperwelbers/corpustools documentation built on Sept. 1, 2018, 1:03 p.m.