tm_map: Transformations on Corpora

Description Usage Arguments Value Note See Also Examples

View source: R/transform.R

Description

Interface to apply transformation functions (also denoted as mappings) to corpora.

Usage

1
2
3
4
5
6
## S3 method for class 'PCorpus'
tm_map(x, FUN, ...)
## S3 method for class 'SimpleCorpus'
tm_map(x, FUN, ...)
## S3 method for class 'VCorpus'
tm_map(x, FUN, ..., lazy = FALSE)

Arguments

x

A corpus.

FUN

a transformation function taking a text document (a character vector when x is a SimpleCorpus) as input and returning a text document (a character vector of the same length as the input vector for SimpleCorpus). The function content_transformer can be used to create a wrapper to get and set the content of text documents.

...

arguments to FUN.

lazy

a logical. Lazy mappings are mappings which are delayed until the content is accessed. It is useful for large corpora if only few documents will be accessed. In such a case it avoids the computationally expensive application of the mapping to all elements in the corpus.

Value

A corpus with FUN applied to each document in x. In case of lazy mappings only internal flags are set. Access of individual documents triggers the execution of the corresponding transformation function.

Note

Lazy transformations change R's standard evaluation semantics.

See Also

getTransformations for available transformations.

Examples

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
data("crude")
## Document access triggers the stemming function
## (i.e., all other documents are not stemmed yet)
tm_map(crude, stemDocument, lazy = TRUE)[[1]]
## Use wrapper to apply character processing function
tm_map(crude, content_transformer(tolower))
## Generate a custom transformation function which takes the heading as new content
headings <- function(x)
    PlainTextDocument(meta(x, "heading"),
                      id = meta(x, "id"),
                      language = meta(x, "language"))
inspect(tm_map(crude, headings))

Example output

Loading required package: NLP
<<PlainTextDocument>>
Metadata:  15
Content:  chars: 484
<<VCorpus>>
Metadata:  corpus specific: 0, document level (indexed): 0
Content:  documents: 20
<<VCorpus>>
Metadata:  corpus specific: 0, document level (indexed): 0
Content:  documents: 20

$`reut-00001.xml`
<<PlainTextDocument>>
Metadata:  7
Content:  chars: 40

$`reut-00002.xml`
<<PlainTextDocument>>
Metadata:  7
Content:  chars: 47

$`reut-00004.xml`
<<PlainTextDocument>>
Metadata:  7
Content:  chars: 41

$`reut-00005.xml`
<<PlainTextDocument>>
Metadata:  7
Content:  chars: 41

$`reut-00006.xml`
<<PlainTextDocument>>
Metadata:  7
Content:  chars: 41

$`reut-00007.xml`
<<PlainTextDocument>>
Metadata:  7
Content:  chars: 45

$`reut-00008.xml`
<<PlainTextDocument>>
Metadata:  7
Content:  chars: 49

$`reut-00009.xml`
<<PlainTextDocument>>
Metadata:  7
Content:  chars: 37

$`reut-00010.xml`
<<PlainTextDocument>>
Metadata:  7
Content:  chars: 39

$`reut-00011.xml`
<<PlainTextDocument>>
Metadata:  7
Content:  chars: 47

$`reut-00012.xml`
<<PlainTextDocument>>
Metadata:  7
Content:  chars: 46

$`reut-00013.xml`
<<PlainTextDocument>>
Metadata:  7
Content:  chars: 49

$`reut-00014.xml`
<<PlainTextDocument>>
Metadata:  7
Content:  chars: 49

$`reut-00015.xml`
<<PlainTextDocument>>
Metadata:  7
Content:  chars: 48

$`reut-00016.xml`
<<PlainTextDocument>>
Metadata:  7
Content:  chars: 40

$`reut-00018.xml`
<<PlainTextDocument>>
Metadata:  7
Content:  chars: 45

$`reut-00019.xml`
<<PlainTextDocument>>
Metadata:  7
Content:  chars: 45

$`reut-00021.xml`
<<PlainTextDocument>>
Metadata:  7
Content:  chars: 46

$`reut-00022.xml`
<<PlainTextDocument>>
Metadata:  7
Content:  chars: 44

$`reut-00023.xml`
<<PlainTextDocument>>
Metadata:  7
Content:  chars: 45

tm documentation built on May 30, 2017, 6:57 a.m.

Search within the tm package
Search all R packages, documentation and source code