new_stemmer: Stemmer Construction

Description Usage Arguments Details Value See Also Examples

View source: R/stem.R

Description

Make a stemmer from a set of (term, stem) pairs.

Usage

1
2
new_stemmer(term, stem, default = NULL, duplicates = "first",
            vectorize = TRUE)

Arguments

term

character vector of terms to stem.

stem

character vector the same length as term with entries giving the corresponding stems.

default

if non-NULL, a default value to use for terms that do not have a stem; NULL specifies that such terms should be left unchanged.

duplicates

action to take for duplicates in the term list. See ‘Details’

.

vectorize

whether to produce a vectorized stemmer that accepts and returns vector arguments.

Details

Giving a list of terms and a corresponding list of stems, this produces a function that maps terms to their corresponding entry. If default = NULL, then values absent from the term argument get left as-is; otherwise, they get replaced by the default value.

The duplicates argument indicates the action to take if there are duplicate entries in the term argument:

Value

By default, with vectorize = TRUE, the resulting stemmer accepts a character vector as input and returns a character vector of the same length with entries giving the stems of the corresponding input entries.

Setting vectorize = FALSE gives a function that accepts a single input and returns a single output. This can be more efficient when used as part of a text_filter.

See Also

stem_snowball, text_filter, text_tokens.

Examples

1
2
3
4
5
6
7
# map uppercase to lowercase, leave others unchanged
stemmer <- new_stemmer(LETTERS, letters)
stemmer(c("A", "E", "I", "O", "U", "1", "2", "3"))

# map uppercase to lowercase, drop others
stemmer <- new_stemmer(LETTERS, letters, default = NA)
stemmer(c("A", "E", "I", "O", "U", "1", "2", "3"))

Example output

[1] "a" "e" "i" "o" "u" "1" "2" "3"
[1] "a" "e" "i" "o" "u" NA  NA  NA 

corpus documentation built on May 2, 2021, 9:06 a.m.