vocabularyDlg: Vocabulary Summary

Description Details See Also

Description

Build vocabulary summary table over documents or a meta-data variable of a corpus.

Details

This dialog allows creating tables providing several vocabulary measures for each document of a corpus, or each of the categories of a corpus variable:

Words are defined as the forms of two or more characters present in the texts before stemming and stopword removal. On the contrary, unique terms are extracted from the global document-term matrix, which means they do not include words that were removed by treatments ran at the import step, and that words different in the original text might become identical terms if stemming was performed. This can be considered the “correct” measure, since the purpose of corpus processing is exactly that: mark different forms of the same term as similar to allow for statistical analyses.

Two different units can be selected for the analysis. If “Document” is selected, values reported for each level correspond to the mean of the values for each of its documents; a mean column for the whole corpus is also provided. If “Level” is selected, these values correspond to the sum of the number of terms for each of the categories' documents, to the percentage of terms (ratio of the summed numbers of terms) and the average word length of the level when taken as a single document. Both versions of this measure are legitimate, but prompt different interpretations that should not be confused; on the contrary, interpretation of the summed or mean number of (long) terms is immediate.

This distinction does not make sense when documents (not levels of a variable) are used as the unit of analysis: in this case, “level” in the above explanation corresponds to “document”, and two columns are provided about the whole corpus. “Corpus mean” is simply the average value of measures over all documents; “Corpus total” is the sum of the number of terms, the percentage of terms (ratio of the summed numbers of terms) and the average word length in the corpus when taken as a single document. See vocabularyTable for more details.

See Also

vocabularyTable, setCorpusVariables, meta, DocumentTermMatrix, table, barchart


RcmdrPlugin.temis documentation built on May 2, 2019, 5:21 p.m.