textmatrix: Textmatrix (Matrices)
In lsa: Latent Semantic Analysis

textmatrix

R Documentation

Textmatrix (Matrices)

Description

Creates a document-term matrix from all textfiles in a given directory.

Usage

textmatrix( mydir, stemming=FALSE, language="english",
   minWordLength=2, maxWordLength=FALSE, minDocFreq=1, 
   maxDocFreq=FALSE, minGlobFreq=FALSE, maxGlobFreq=FALSE, 
   stopwords=NULL, vocabulary=NULL, phrases=NULL, 
   removeXML=FALSE, removeNumbers=FALSE)
textvector( file, stemming=FALSE, language="english", 
   minWordLength=2, maxWordLength=FALSE, minDocFreq=1, 
   maxDocFreq=FALSE, stopwords=NULL, vocabulary=NULL, 
   phrases=NULL, removeXML=FALSE, removeNumbers=FALSE )

Arguments

`file`	filename (may include path).
`mydir`	the directory path (e.g., `"corpus/texts/"`); may be single files/directories or a vector of files/directories.
`stemming`	boolean indicating whether to reduce all terms to their wordstem.
`language`	specifies language for the stemming / stop-word-removal.
`minWordLength`	words with less than minWordLength characters will be ignored.
`maxWordLength`	words with more than maxWordLength characters will be ignored; per default set to `FALSE` to use no upper boundary.
`minDocFreq`	words of a document appearing less than minDocFreq within that document will be ignored.
`maxDocFreq`	words of a document appearing more often than maxDocFreq within that document will be ignored; per default set to `FALSE` to use no upper boundary for document frequencies.
`minGlobFreq`	words which appear in less than minGlobFreq documents will be ignored.
`maxGlobFreq`	words which appear in more than maxGlobFreq documents will be ignored.
`stopwords`	a stopword list that contains terms the will be ignored.
`vocabulary`	a character vector containing the words: only words in this term list will be used for building the matrix (‘controlled vocabulary’).
`removeXML`	if set to `TRUE`, XML tags (elements, attributes, some special characters) will be removed.
`removeNumbers`	if set to `TRUE`, terms that consist only out of numbers will be removed.
`phrases`	not implemented, yet.

Details

All documents in the specified directory are read and a matrix is composed. The matrix contains in every cell the exact number of appearances (i.e., the term frequency) of every word for all documents. If specified, simple text preprocessing mechanisms are applied (stemming, stopword filtering, wordlength cutoffs).

Stemming thereby uses Porter's snowball stemmer (from package SnowballC).

There are two stopword lists included (for english and for german), which are loaded on demand into the variables stopwords_de and stopwords_en. They can be activated by calling data(stopwords_de) or data(stopwords_en). Attention: the stopword lists have to be already loaded when textmatrix() is called.

textvector() is a support function that creates a list of term-in-document occurrences.

For every generated matrix, an own environment is added as an attribute which holds the triples that are stored by setTriple() and can be retrieved with getTriple().

If the language is set to "arabic", special characters for the Buckwalter transliteration will be kept.

Value

textmatrix

the document-term matrix (incl. row and column names).

Author(s)

Fridolin Wild f.wild@open.ac.uk

Examples


# create some files
td = tempfile()
dir.create(td)
write( c("dog", "cat", "mouse"), file=paste(td, "D1", sep="/") )
write( c("hamster", "mouse", "sushi"), file=paste(td, "D2", sep="/") )
write( c("dog", "monster", "monster"), file=paste(td, "D3", sep="/") )

# read them, create a document-term matrix
textmatrix(td)

# read them, drop german stopwords
data(stopwords_de)
textmatrix(td, stopwords=stopwords_de)

# read them based on a controlled vocabulary
voc = c("dog", "mouse")
textmatrix(td, vocabulary=voc, minWordLength=1)

# clean up
unlink(td, recursive=TRUE)

lsa documentation built on May 9, 2022, 9:10 a.m.