In kasperwelbers/corpus-tools: tools for corpus analysis

library(knitr)
opts_chunk$set(fig.path = "figures_dtm/")

The Document-Term Matrix (dtm)

For many methods of text analysis, specifically the so-called bag-of-word approaches, the common data structure for the corpus is a document-term matrix (dtm). This is a matrix in which the rows represent documents and the columns represent terms (e.g., words, lemmata). The values represent how often each word occured in each document.

For example, in this dtm we can see how often the words 'spam', 'bacon', and 'egg', occured in each of the 4 documents:

m = matrix(sample(0:2, replace=T, 12), ncol=3, dimnames=list(documents=c('doc1','doc2','doc3','doc4'), terms=c('spam','bacon','egg')))
m

Commonly, this matrix is very sparse. That is, a large majority of the values will be zero, since documents generally only use a small portion of the words used in the entire corpus. For efficient computing and data storage, it is therefore worthwhile to not simply use a matrix, but a type of sparse matrix. We suggest the use of the DocumentTermMatrix class that is available in the tm (text mining) package. This is a type of sparse matrix dedicated to text analysis.

library(tm)
dtm = as.DocumentTermMatrix(m, weight=weightTf)
dtm

Since the tm package is quite popular, this dtm class is compatible with various packages for text analysis. For many of the functions offered in the corpustools package we also use this dtm class as a starting point. In the next steps of this howto we provide some pointers for importing your data and transforming it into a dtm.

Creating the Document-Term Matrix from full-text

If your data is in common, unprocessed textual form (such as this document) then a good approach is to use the tm package. tm offers various ways to import texts to the tm corpus structure. Additionally, it offers various functions to pre-process the data. For instance, removing stop-words, making everything lowercase and reducing words to their stem. From the corpus format the texts can then easily be tokenized (i.e. cut up into terms) and made into a dtm. If this is your weapon of choice, we gladly refer to the tm package vignette.

Another option is using the create_matrix command from the RTextTools package, which is a convenient wrapper for tm with stemming and stop-word removal:

library(RTextTools)
create_matrix(c("One document", "And another"), removeStopwords=T, stemWords=T, language="english")

Creating the Document-Term Matrix from tokens

It could also be that you prefer to pre-process and tokenize your data outside of R; possibly using more advanced methods such as lemmatizing and part-of-speech tagging, (that to our knowledge are not supported well within R). For reference, an open-source software package to accommodate this process that we are affiliated with is AmCAT. A howto for creating a DTM using amcat can be found here

If data has been pre-processed and tokenized elsewhere, you can usually export a list of words in a csv like format (e.g. the standardized conll format). This can be imported into R is as a data.frame (e.g., using read.csv), in which each row represents how often a token occured within a document.

In the following example, we use the sotu data set provided with this package, which contains a list of tokens produced by using Stanford's CoreNLP preprocessing on the State of the Union addresses by presidents (George W.) Bush and Obama

library(corpustools)
data(sotu)
head(sotu.tokens)

To our awareness, tm has no functions to directly transform this data format into a DocumentTermMatrix, but it can quite easily be casted into a sparse matrix, and then transformed into a dtm. The dtm.create function, offered in the corpustools package, can do this for you. For our example, we use the lemma of the words. In the dtm.create function we simply give the vector for the lemma, together with a vector indicating in which documents the lemma occured (aid).

dtm = dtm.create(documents=sotu.tokens$aid, terms=sotu.tokens$lemma)
dtm

Since we are normally not interested in punctuation, prepositions and the like, we can filter the data set on part-of-speech tag, for example keeping only the nouns:

tokens = sotu.tokens[sotu.tokens$pos1 == 'N', ]
dtm = dtm.create(documents=tokens$aid, terms=tokens$lemma)
dtm

As you can see, we now have 2210 unique terms rather than 5264. One of the cool things to do with a dtm is create a word cloud, which is made easy by the dtm.wordcloud function:

t = term.statistics(dtm)
dtm.wordcloud(dtm)

For more information on how to work with dtms using corpustools, please see the other howto's:

kasperwelbers/corpus-tools documentation built on May 20, 2019, 7:37 a.m.

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

kasperwelbers/corpus-tools
tools for corpus analysis

In kasperwelbers/corpus-tools: tools for corpus analysis

The Document-Term Matrix (dtm)

Creating the Document-Term Matrix from full-text

Creating the Document-Term Matrix from tokens

R Package Documentation

Browse R Packages

We want your feedback!

kasperwelbers/corpus-tools tools for corpus analysis

In kasperwelbers/corpus-tools: tools for corpus analysis

The Document-Term Matrix (dtm)

Creating the Document-Term Matrix from full-text

Creating the Document-Term Matrix from tokens

R Package Documentation

Browse R Packages

We want your feedback!

kasperwelbers/corpus-tools
tools for corpus analysis