vocabularyTable: Vocabulary summary table
In RcmdrPlugin.temis: Graphical Integrated Text Mining Solution

vocabularyTable

R Documentation

Vocabulary summary table

Description

Build a table summarizing vocabulary, optionally over a variable.

Usage

vocabularyTable(termsDtm, wordsDtm, variable = NULL, unit = c("document", "global"))

Arguments

`termsDtm`	A document-term matrix containing terms (i.e. extracted from a possibly stemmed corpus).
`wordsDtm`	A document-term matrix contaning words (i.e. extracted from a plain corpus).
`variable`	A vector with one element per document indicating to which category it belongs. If `NULL`, per-document measures are returned.
`unit`	When `variable` is not `NULL`, defines the way measures are aggregated (see below).

Details

This dialog allows creating tables providing several vocabulary measures for each document or each category of documents in the corpus:

total number of terms
number and percent of unique terms (i.e. appearing at least once)
number and percent of hapax legomena (i.e. terms appearing once and only once)
total number of words
number and percent of long words (“long” being defined as “at least seven characters”
number and percent of very long words (“very long” being defined as “at least ten characters”
average word length

Words are defined as the forms of two or more characters present in the texts before stemming and stopword removal. On the contrary, unique terms are extracted from the global document-term matrix, which means they do not include words that were removed by treatments ran at the import step, and that words different in the original text might become identical terms if stemming was performed. This can be considered the “correct” measure, since the purpose of corpus processing is exactly that: mark different forms of the same term as similar to allow for statistical analyses.

Please note that percentages for terms and words are computed with regard respectively to the total number of terms and of words, so the denominators are not the same for all measures. See vocabularyDlg.

When variable is not NULL, unit defines two different ways of aggregating per-document statistics into per-category measures:

document:: Values computed for each document are simply averaged for each category.
global:: Values are computed for each category taken as a whole: word counts are summed for each category, and ratios and average are calculated for this level only, from the summed counts.

In both cases, the “Corpus” column follows the above definition.

RcmdrPlugin.temis
Graphical Integrated Text Mining Solution

vocabularyTable: Vocabulary summary table
In RcmdrPlugin.temis: Graphical Integrated Text Mining Solution

Vocabulary summary table

Description

Usage

Arguments

Details

See Also

Related to vocabularyTable in RcmdrPlugin.temis...

R Package Documentation

Browse R Packages

We want your feedback!

RcmdrPlugin.temis Graphical Integrated Text Mining Solution

vocabularyTable: Vocabulary summary table In RcmdrPlugin.temis: Graphical Integrated Text Mining Solution

Vocabulary summary table

Description

Usage

Arguments

Details

See Also

Related to vocabularyTable in RcmdrPlugin.temis...

R Package Documentation

Browse R Packages

We want your feedback!

RcmdrPlugin.temis
Graphical Integrated Text Mining Solution

vocabularyTable: Vocabulary summary table
In RcmdrPlugin.temis: Graphical Integrated Text Mining Solution