characteristic_docs: characteristic_docs

Description Usage Arguments Details Value Examples

View source: R/corpus.R

Description

Print documents which are the most characteristic of each level of a variable, i.e. those with the lowest Chi-squared distance to the average vocabulary of documents belonging to that level.

Usage

1
characteristic_docs(corpus, dtm, variable, ndocs = 10, nterms = 25, p = 0.1)

Arguments

corpus

A Corpus object.

dtm

A DocumentTermMatrix object corresponding to corpus.

variable

A vector of values giving the groups for which most frequent terms should be reported.

ndocs

The number of (most characteristic) documents to print.

nterms

The number of terms to highlight in documents.

p

The maximum p-value up to which specific terms should be hightlighted.

Details

Occurrences of the nterms most specific terms for each level are highlighted. If stemming or other transformations have been applied to original words using combine_terms, all original words which have been transformed to the specified terms are highlighted.

Value

A list with one Corpus object for each level (invisibly).

Examples

1
2
3
4
5
6
7
8
9
file <- system.file("texts", "reut21578-factiva.xml", package="tm.plugin.factiva")
corpus <- import_corpus(file, "factiva", language="en")
dtm <- build_dtm(corpus)
characteristic_docs(corpus, dtm, meta(corpus)$Date)

# Also works when terms have been combined
dict <- dictionary(dtm)
dtm2 <- combine_terms(dtm, dict)
characteristic_docs(corpus, dtm2, meta(corpus)$Date)

R.temis documentation built on May 13, 2021, 1:08 a.m.