lemmatize: Lemmatize a Vector of Strings

Description Usage Arguments Value Note Examples

View source: R/lemmatize.R

Description

Lemmatize a vector of strings.

Usage

1
lemmatize(x, dictionary = lemmar::hash_lemma_en, ...)

Arguments

x

A vector of strings.

dictionary

A dictionary of base terms and lemmas to use for replacement. The first column should be the full word form in lower case while the second column is the corresponding replacement lemma.

...

Other arguments passed to split_token.

Value

Returns a vector of lemmatized strings.

Note

The lemmatizer splits the string apart into tokens for speed optimization. After the lemmatizing occurs the strings are pasted back together. The strings are not guaranteed to retain exact spacing of the original.

Examples

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
x <- c(
    'the dirtier dog has eaten the pies',
    'that shameful pooch is tricky and sneaky',
    "He opened and then reopened the food bag",
    'There are skies of blue and red roses too!',
    NA,
    "The doggies, well they aren't joyfully running.",
    "The daddies are coming over...",
    "This is 34.546 above"
)

lemmatize(x)

## Bigger data set
library(dplyr)
presidential_debates_2012$dialogue %>%
    head()
gc(); tic <- Sys.time()
presidential_debates_2012$dialogue %>%
    lemmatize() %>%
    head()
cat(sprintf(
    '%s seconds for %s rows of text\n',
    round(Sys.time() - tic, 2),
    nrow(presidential_debates_2012)
))

trinker/lemmar documentation built on May 7, 2019, 3:57 a.m.