inst/words.R
In SnowballC: Snowball Stemmers Based on the C 'libstemmer' UTF-8 Library

# The following code can be used to read vocabulary lists from
# https://github.com/snowballstem/snowball-data
# Manual fixes are needed to replace empty values by ""
for(lang in getStemLanguages()) {
    cat(lang, "\n")
    vocf <- file.path("snowball-data", lang, "voc.txt")
    if(!file.exists(vocf)) vocf <- file.path("snowball-data", lang, "voc.txt.gz")
    outputf <- file.path("snowball-data", lang, "output.txt")
    if(!file.exists(outputf)) outputf <- file.path("snowball-data", lang, "output.txt.gz")
    voc <- readLines(vocf, encoding="UTF-8")
    output <- readLines(outputf, encoding="UTF-8")
    stopifnot(all(wordStem(voc, lang) == output))

    dat <- data.frame(word=voc, stem=output, stringsAsFactors=FALSE)
    # Only keep a subsample of words to reduce space needed for CRAN releases
    dat <- dat[seq(1, nrow(dat), length.out=1000),]
    save(dat, file=file.path("words", paste0(lang, ".RData")), compress="xz")
}

Any scripts or data that you put into this service are public.

SnowballC documentation built on April 26, 2023, 1:17 a.m.

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

SnowballC
Snowball Stemmers Based on the C 'libstemmer' UTF-8 Library

inst/words.R
In SnowballC: Snowball Stemmers Based on the C 'libstemmer' UTF-8 Library

Try the SnowballC package in your browser

R Package Documentation

Browse R Packages

We want your feedback!

SnowballC Snowball Stemmers Based on the C 'libstemmer' UTF-8 Library

inst/words.R In SnowballC: Snowball Stemmers Based on the C 'libstemmer' UTF-8 Library

Try the SnowballC package in your browser

R Package Documentation

Browse R Packages

We want your feedback!

SnowballC
Snowball Stemmers Based on the C 'libstemmer' UTF-8 Library

inst/words.R
In SnowballC: Snowball Stemmers Based on the C 'libstemmer' UTF-8 Library