textmatrix | R Documentation |
Creates a document-term matrix from all textfiles in a given directory.
textmatrix( mydir, stemming=FALSE, language="english", minWordLength=2, maxWordLength=FALSE, minDocFreq=1, maxDocFreq=FALSE, minGlobFreq=FALSE, maxGlobFreq=FALSE, stopwords=NULL, vocabulary=NULL, phrases=NULL, removeXML=FALSE, removeNumbers=FALSE) textvector( file, stemming=FALSE, language="english", minWordLength=2, maxWordLength=FALSE, minDocFreq=1, maxDocFreq=FALSE, stopwords=NULL, vocabulary=NULL, phrases=NULL, removeXML=FALSE, removeNumbers=FALSE )
file |
filename (may include path). |
mydir |
the directory path (e.g., |
stemming |
boolean indicating whether to reduce all terms to their wordstem. |
language |
specifies language for the stemming / stop-word-removal. |
minWordLength |
words with less than minWordLength characters will be ignored. |
maxWordLength |
words with more than maxWordLength characters will be ignored; per default set to |
minDocFreq |
words of a document appearing less than minDocFreq within that document will be ignored. |
maxDocFreq |
words of a document appearing more often than maxDocFreq within that document will be ignored; per default set to |
minGlobFreq |
words which appear in less than minGlobFreq documents will be ignored. |
maxGlobFreq |
words which appear in more than maxGlobFreq documents will be ignored. |
stopwords |
a stopword list that contains terms the will be ignored. |
vocabulary |
a character vector containing the words: only words in this term list will be used for building the matrix (‘controlled vocabulary’). |
removeXML |
if set to |
removeNumbers |
if set to |
phrases |
not implemented, yet. |
All documents in the specified directory are read and a matrix is composed. The matrix contains in every cell the exact number of appearances (i.e., the term frequency) of every word for all documents. If specified, simple text preprocessing mechanisms are applied (stemming, stopword filtering, wordlength cutoffs).
Stemming thereby uses Porter's snowball stemmer (from package SnowballC
).
There are two stopword lists included (for english and for german), which
are loaded on demand into the variables stopwords_de
and
stopwords_en
. They can be activated by calling data(stopwords_de)
or data(stopwords_en)
. Attention: the stopword lists have
to be already loaded when textmatrix()
is called.
textvector()
is a support function that creates a list of
term-in-document occurrences.
For every generated matrix, an own environment is added as an attribute which
holds the triples that are stored by setTriple()
and can be
retrieved with getTriple()
.
If the language is set to "arabic", special characters for the Buckwalter transliteration will be kept.
textmatrix |
the document-term matrix (incl. row and column names). |
Fridolin Wild f.wild@open.ac.uk
wordStem
, stopwords_de
, stopwords_en
, setTriple
, getTriple
# create some files td = tempfile() dir.create(td) write( c("dog", "cat", "mouse"), file=paste(td, "D1", sep="/") ) write( c("hamster", "mouse", "sushi"), file=paste(td, "D2", sep="/") ) write( c("dog", "monster", "monster"), file=paste(td, "D3", sep="/") ) # read them, create a document-term matrix textmatrix(td) # read them, drop german stopwords data(stopwords_de) textmatrix(td, stopwords=stopwords_de) # read them based on a controlled vocabulary voc = c("dog", "mouse") textmatrix(td, vocabulary=voc, minWordLength=1) # clean up unlink(td, recursive=TRUE)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.