prep.data | R Documentation |
Prepares data for iSA algorithm. This is a pre-processing step which performs stemming and other cleaning steps.
prep.data(corpus, th=0.99, lang="english", train=NULL, use.all=TRUE, shannon=FALSE, verbose=FALSE, stripWhite=TRUE, removeNum=TRUE, removePunct=TRUE, removeStop=TRUE, toPlain=TRUE, doGC=FALSE)
corpus |
a corpus from the |
th |
threshold used to drop stems |
lang |
language of texts. mainly used for stemming |
train |
a vector of tags for the training set |
use.all |
use all data or just the traning set? |
shannon |
use Shannon entropy? |
stripWhite |
force stripWhite? if not JP or CN always TRUE |
removeNum |
removeNum? if not JP or CN always TRUE |
removePunct |
removePunct? if not JP or CN always TRUE |
removeStop |
removeStop? if not JP or CN always TRUE |
toPlain |
convert to plain text internally? if not JP or CN always TRUE |
doGC |
perform garbage collection? Better TRUE for large corpus |
verbose |
should show all steps? |
This function requires tm
package to performs stemming.
A list with components:
S |
the vector of stem strings |
dtm |
the document-term matrix |
train |
the vector of tags for the training data |
train |
the vselected threshold |
Stefano M. Iacus
Iacus, S.M., Curini, L., Ceron, A. (2015) iSA (U.S. provisional patent application No. 62/215264) Ceron, A., Curini, L., Iacus, S.M. (2016) iSA: A fast, scalable and accurate algorithm for sentiment analysis of social media content, Information Sciences, V. 367-368, p. 105-124.
iSA
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.