textToXY: Tools for Text Classification
In matloff/regtools: Regression and Classification Tools

textToXY,textToXYpred

R Documentation

Tools for Text Classification

Description

"R-style," classification-oriented wrappers for the text2vec package.

Usage

    textToXY(docs,labels,kTop=50,stopWords='a') 
    textToXYpred(ttXYout,predDocs)

Arguments

`docs`	Character vector, one element per document.
`predDocs`	Character vector, one element per document.
`labels`	Class labels, as numeric, character or factor. NULL is used at the prediction stage.
`kTop`	The number of most-frequent words to retain; 0 means retain all.
`stopWords`	Character vector of common words, e.g. prepositions to delete. Recommended is `tm::stopwords('english')`.
`ttXYout`	Output object from `textToXY`.

Details

A typical classification/machine learning package will have as arguments a feature matrix X and a labels vector/factor Y. For a "bag of words" analysis in the text case, each row of X would be a document and each column a word.

The functions here are basically wrappers for generating X. Wrappers are convenient in that:

The text2vec package is rather arcane, so a "R-style" wrapper would be useful.
The text2vec are not directly set up to do classification, so the functions here provide the "glue" to do that.

The typical usage pattern is thus:

Run the documents vector and labels vector/factor through textToXY, generating X and Y.
Apply your favorite classification/machine learning package p to X and Y, returning o.
When predicting a new document d, run o and d through textToXY, producing x.
Run x on p's predict function.