textToXY: Tools for Text Classification

View source: R/Text.R

textToXY,textToXYpredR Documentation

Tools for Text Classification

Description

"R-style," classification-oriented wrappers for the text2vec package.

Usage

    textToXY(docs,labels,kTop=50,stopWords='a') 
    textToXYpred(ttXYout,predDocs) 

Arguments

docs

Character vector, one element per document.

predDocs

Character vector, one element per document.

labels

Class labels, as numeric, character or factor. NULL is used at the prediction stage.

kTop

The number of most-frequent words to retain; 0 means retain all.

stopWords

Character vector of common words, e.g. prepositions to delete. Recommended is tm::stopwords('english').

ttXYout

Output object from textToXY.

Details

A typical classification/machine learning package will have as arguments a feature matrix X and a labels vector/factor Y. For a "bag of words" analysis in the text case, each row of X would be a document and each column a word.

The functions here are basically wrappers for generating X. Wrappers are convenient in that:

  • The text2vec package is rather arcane, so a "R-style" wrapper would be useful.

  • The text2vec are not directly set up to do classification, so the functions here provide the "glue" to do that.

The typical usage pattern is thus:

  • Run the documents vector and labels vector/factor through textToXY, generating X and Y.

  • Apply your favorite classification/machine learning package p to X and Y, returning o.

  • When predicting a new document d, run o and d through textToXY, producing x.

  • Run x on p's predict function.

Value

The function textToXY returns an R list with components x and y for X and Y, and a copy of the input stopWords.

The function textToXY returns X.

Author(s)

Norm Matloff


matloff/regtools documentation built on July 17, 2022, 10:10 a.m.