prep.data: Prepares data for iSA algorithm

View source: R/prep.data.R

prep.dataR Documentation

Prepares data for iSA algorithm

Description

Prepares data for iSA algorithm. This is a pre-processing step which performs stemming and other cleaning steps.

Usage

prep.data(corpus, th=0.99, lang="english", train=NULL,
          use.all=TRUE, shannon=FALSE, verbose=FALSE,
          stripWhite=TRUE, removeNum=TRUE, removePunct=TRUE,
          removeStop=TRUE, toPlain=TRUE, doGC=FALSE)

Arguments

corpus

a corpus from the tm package

th

threshold used to drop stems

lang

language of texts. mainly used for stemming

train

a vector of tags for the training set

use.all

use all data or just the traning set?

shannon

use Shannon entropy?

stripWhite

force stripWhite? if not JP or CN always TRUE

removeNum

removeNum? if not JP or CN always TRUE

removePunct

removePunct? if not JP or CN always TRUE

removeStop

removeStop? if not JP or CN always TRUE

toPlain

convert to plain text internally? if not JP or CN always TRUE

doGC

perform garbage collection? Better TRUE for large corpus

verbose

should show all steps?

Details

This function requires tm package to performs stemming.

Value

A list with components:

S

the vector of stem strings

dtm

the document-term matrix

train

the vector of tags for the training data

train

the vselected threshold

Author(s)

Stefano M. Iacus

References

Iacus, S.M., Curini, L., Ceron, A. (2015) iSA (U.S. provisional patent application No. 62/215264) Ceron, A., Curini, L., Iacus, S.M. (2016) iSA: A fast, scalable and accurate algorithm for sentiment analysis of social media content, Information Sciences, V. 367-368, p. 105-124.

See Also

iSA


blogsvoices/iSAX documentation built on Oct. 11, 2022, 2:38 p.m.