knitr::opts_chunk$set( collapse = TRUE, comment = "#>" ) reticulate::use_virtualenv("../env")
First we preprocess the corpus using example data, a tiny corpus of 9 documents. Reproducing the tutorial on corpora and vector spaces.
library(gensimr) set.seed(42) # rerproducability # sample data data(corpus, package = "gensimr") print(corpus) # preprocess corpus docs <- prepare_documents(corpus)
This produces the same output as the built-in prepared documents.
common_texts()
The following are methods that work on lists, character vectors and data.frames.
preprocessed <- preprocess(corpus) preprocessed[[1]]
By default, the function preprocess
applies the following:
strip_tags
strip_punctuation
strip_multiple_spaces
strip_numeric
remove_stopwords
strip_short
stem_text
preprocessed <- preprocess(corpus, filters = c("strip_tags", "strip_punctuation", "strip_multiple_spaces", "strip_numeric", "remove_stopwords")) preprocessed[[1]]
Remove stopwords.
remove_stopwords(corpus[[1]])
Remove short words.
remove_stopwords(corpus[[2]], min_len = 3)
split_alphanum("24.0hours7 days365 a1b2c3")
Replaces punctuation with space.
strip_punctuation("A semicolon is a stronger break than a comma, but not as much as a full stop!")
Removes tags.
strip_tags("<i>Hello</i> <b>World</b>!")
Removes digits.
strip_numeric("0text24gensim365test")
Removes non-alphabetic characters.
strip_non_alphanum("if-you#can%read$this&then@this#method^works")
Remove repeating whitespace characters (spaces, tabs, line breaks) from s and turns tabs & line breaks into spaces.
strip_multiple_spaces(paste0("salut", '\r', " les", '\n', " loulous!"))
Transform to lowercase and stem.
stem_text("It is useful to be able to search a large collection of documents almost instantly.")
stemmer <- porter_stemmer() stemmer$stem_sentence("Cats and ponies have meeting") stemmer$stem_documents(c("Cats and ponies", "have meeting"))
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.