create_matrix: creates a document-term matrix to be passed into...

Description Usage Arguments Author(s) Examples

View source: R/create_matrix.R

Description

Creates an object of class DocumentTermMatrix from tm that can be used in the create_container function.

Usage

1
2
3
4
5
create_matrix(textColumns, language="english", minDocFreq=1, maxDocFreq=Inf, 
minWordLength=3, maxWordLength=Inf, ngramLength=1, originalMatrix=NULL, 
removeNumbers=FALSE, removePunctuation=TRUE, removeSparseTerms=0, 
removeStopwords=TRUE,  stemWords=FALSE, stripWhitespace=TRUE, toLower=TRUE, 
weighting=weightTf)

Arguments

textColumns

Either character vector (e.g. data$Title) or a cbind() of columns to use for training the algorithms (e.g. cbind(data$Title,data$Subject)).

language

The language to be used for stemming the text data.

minDocFreq

The minimum number of times a word should appear in a document for it to be included in the matrix. See package tm for more details.

maxDocFreq

The maximum number of times a word should appear in a document for it to be included in the matrix. See package tm for more details.

minWordLength

The minimum number of letters a word or n-gram should contain to be included in the matrix. See package tm for more details.

maxWordLength

The maximum number of letters a word or n-gram should contain to be included in the matrix. See package tm for more details.

ngramLength

The number of words to include per n-gram for the document-term matrix.

originalMatrix

The original DocumentTermMatrix used to train the models. If supplied, will adjust the new matrix to work with saved models.

removeNumbers

A logical parameter to specify whether to remove numbers.

removePunctuation

A logical parameter to specify whether to remove punctuation.

removeSparseTerms

See package tm for more details.

removeStopwords

A logical parameter to specify whether to remove stopwords using the language specified in language.

stemWords

A logical parameter to specify whether to stem words using the language specified in language.

stripWhitespace

A logical parameter to specify whether to strip whitespace.

toLower

A logical parameter to specify whether to make all text lowercase.

weighting

Either weightTf or tm::weightTfIdf. See package tm for more details.

Author(s)

Timothy P. Jurka <tpjurka@ucdavis.edu>, Loren Collingwood <lorenc2@uw.edu>

Examples

1
2
3
4
5
library(RTextTools)
data(NYTimes)
data <- NYTimes[sample(1:3100,size=100,replace=FALSE),]
matrix <- create_matrix(cbind(data["Title"],data["Subject"]), language="english", 
removeNumbers=TRUE, stemWords=FALSE, weighting=tm::weightTfIdf)

Example output

Loading required package: SparseM

Attaching package: 'SparseM'

The following object is masked from 'package:base':

    backsolve

RTextTools documentation built on April 26, 2020, 9:05 a.m.