library(knitr) opts_chunk$set(fig.path = "figures_dtm/")
For many methods of text analysis, specifically the so-called bag-of-word approaches, the common data structure for the corpus is a document-term matrix (dtm). This is a matrix in which the rows represent documents and the columns represent terms (e.g., words, lemmata). The values represent how often each word occured in each document.
For example, in this dtm we can see how often the words 'spam', 'bacon', and 'egg', occured in each of the 4 documents:
m = matrix(sample(0:2, replace=T, 12), ncol=3, dimnames=list(documents=c('doc1','doc2','doc3','doc4'), terms=c('spam','bacon','egg'))) m
Commonly, this matrix is very sparse. That is, a large majority of the values will be zero, since documents generally only use a small portion of the words used in the entire corpus. For efficient computing and data storage, it is therefore worthwhile to not simply use a matrix, but a type of sparse matrix. We suggest the use of the DocumentTermMatrix
class that is available in the tm
(text mining) package. This is a type of sparse matrix dedicated to text analysis.
library(tm) dtm = as.DocumentTermMatrix(m, weight=weightTf) dtm
Since the tm
package is quite popular, this dtm class is compatible with various packages for text analysis. For many of the functions offered in the corpustools
package we also use this dtm class as a starting point. In the next steps of this howto we provide some pointers for importing your data and transforming it into a dtm.
If your data is in common, unprocessed textual form (such as this document) then a good approach is to use the tm
package. tm
offers various ways to import texts to the tm
corpus structure. Additionally, it offers various functions to pre-process the data. For instance, removing stop-words, making everything lowercase and reducing words to their stem. From the corpus
format the texts can then easily be tokenized (i.e. cut up into terms) and made into a dtm. If this is your weapon of choice, we gladly refer to the tm package vignette.
Another option is using the create_matrix
command from the RTextTools package, which is a convenient wrapper for tm
with stemming and stop-word removal:
library(RTextTools) create_matrix(c("One document", "And another"), removeStopwords=T, stemWords=T, language="english")
It could also be that you prefer to pre-process and tokenize your data outside of R; possibly using more advanced methods such as lemmatizing and part-of-speech tagging, (that to our knowledge are not supported well within R). For reference, an open-source software package to accommodate this process that we are affiliated with is AmCAT. A howto for creating a DTM using amcat can be found here
If data has been pre-processed and tokenized elsewhere, you can usually export a list of words in a csv like format (e.g. the standardized conll format). This can be imported into R is as a data.frame (e.g., using read.csv), in which each row represents how often a token occured within a document.
In the following example, we use the sotu
data set provided with this package, which contains a list of tokens produced by using Stanford's CoreNLP preprocessing on the State of the Union addresses by presidents (George W.) Bush and Obama
library(corpustools) data(sotu) head(sotu.tokens)
To our awareness, tm
has no functions to directly transform this data format into a DocumentTermMatrix
, but it can quite easily be casted into a sparse matrix, and then transformed into a dtm. The dtm.create
function, offered in the corpustools
package, can do this for you. For our example, we use the lemma of the words. In the dtm.create
function we simply give the vector for the lemma, together with a vector indicating in which documents the lemma occured (aid).
dtm = dtm.create(documents=sotu.tokens$aid, terms=sotu.tokens$lemma) dtm
Since we are normally not interested in punctuation, prepositions and the like, we can filter the data set on part-of-speech tag, for example keeping only the nouns:
tokens = sotu.tokens[sotu.tokens$pos1 == 'N', ] dtm = dtm.create(documents=tokens$aid, terms=tokens$lemma) dtm
As you can see, we now have 2210 unique terms rather than 5264.
One of the cool things to do with a dtm is create a word cloud, which is made easy by the dtm.wordcloud
function:
t = term.statistics(dtm) dtm.wordcloud(dtm)
For more information on how to work with dtm
s using corpustools, please see the other howto's:
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.