corpus_gen: Corpus Generator

Description Usage Arguments Value Author(s) Examples

View source: R/corpus_gen.R

Description

Prepares a certain data vector to build a corpus from. It also filters stopwords, performes stemming, strips whitespace and removes punctuation. You could also use the tm Package to build the Corpus from scratch but this function makes it easy for repeated generations of VCorpus objects.

Usage

1
corpus_gen(data.vector, lang, furtherStops = NULL)

Arguments

data.vector

A vector which contains a String for each Document: c("DocA", "DocB", ..., "DocN")

lang

Language as a String in which the Documents are. Default is "english". This param also has influence in which stopwords are filtered in the generation step.

furtherStops

a Vector of words which should also filtered from the corpus beside the normal stopwords

Value

VCorpus Object

Author(s)

MFinst

Examples

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
##---- Should be DIRECTLY executable !! ----
##-- ==>  Define data, use random,
##--	or do  help(data=index)  for the standard data sets.

## The function is currently defined as
function (data.vector, lang, furtherStops = NULL)
{
    corpus = VCorpus(VectorSource(as.vector(data.vector)), readerControl = list(language = lang))
    corpus = tm_map(corpus, content_transformer(tolower))
    corpus = tm_map(corpus, stripWhitespace)
    corpus = tm_map(corpus, removePunctuation)
    corpus = tm_map(corpus, stemDocument, lang)
    corpus = tm_map(corpus, removeWords, stopwords(lang))
    if (!is.null(furtherStops)) {
        corpus = tm_map(corpus, removeWords, furtherStops)
    }
    return(corpus)
  }

mfinst/TM-CoCit-Support-FM documentation built on March 4, 2020, 8:38 p.m.