knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)

reticulate::use_virtualenv("../env")

First we preprocess the corpus using example data, a tiny corpus of 9 documents. Reproducing the tutorial on corpora and vector spaces.

library(gensimr)

set.seed(42) # rerproducability

# sample data
data(corpus, package = "gensimr")
print(corpus)

# preprocess corpus
docs <- prepare_documents(corpus)

This produces the same output as the built-in prepared documents.

common_texts()

Preprocess

The following are methods that work on lists, character vectors and data.frames.

preprocessed <- preprocess(corpus)
preprocessed[[1]]

By default, the function preprocess applies the following:

preprocessed <- preprocess(corpus, filters = c("strip_tags", "strip_punctuation", "strip_multiple_spaces", "strip_numeric",
    "remove_stopwords"))
preprocessed[[1]]

Remove Stopwords

Remove stopwords.

remove_stopwords(corpus[[1]])

Strip Short

Remove short words.

remove_stopwords(corpus[[2]], min_len = 3)

Split Alphanumerics

split_alphanum("24.0hours7 days365 a1b2c3")

Strip Punctuation

Replaces punctuation with space.

strip_punctuation("A semicolon is a stronger break than a comma, but not as much as a full stop!")

Strip Tags

Removes tags.

strip_tags("<i>Hello</i> <b>World</b>!")

Strip Numerics

Removes digits.

strip_numeric("0text24gensim365test")

Strip Non-alphabetics

Removes non-alphabetic characters.

strip_non_alphanum("if-you#can%read$this&then@this#method^works")

Strip Multiple Spaces

Remove repeating whitespace characters (spaces, tabs, line breaks) from s and turns tabs & line breaks into spaces.

strip_multiple_spaces(paste0("salut", '\r', " les", '\n', "         loulous!"))

Stem

Transform to lowercase and stem.

stem_text("It is useful to be able to search a large collection of documents almost instantly.")

Porter Stemmer

stemmer <- porter_stemmer()
stemmer$stem_sentence("Cats and ponies have meeting")
stemmer$stem_documents(c("Cats and ponies", "have meeting"))


news-r/gensimr documentation built on Jan. 9, 2021, 5:55 a.m.