prepare_documents: Prepare Documents

Description Usage Arguments Details Value

View source: R/preprocess.R

Description

Simple text preprocessor for, namely for example purposes.

Usage

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
prepare_documents(data, ...)

## S3 method for class 'data.frame'
prepare_documents(data, text, doc_id = NULL,
  min_freq = 1, lexicon = c("SMART", "snowball", "onix"), ...,
  return_doc_id = FALSE)

## S3 method for class 'character'
prepare_documents(data, doc_id = NULL,
  min_freq = 1, lexicon = c("SMART", "snowball", "onix"), ...,
  return_doc_id = FALSE)

## S3 method for class 'factor'
prepare_documents(data, doc_id = NULL, min_freq = 1,
  lexicon = c("SMART", "snowball", "onix"), ..., return_doc_id = FALSE)

Arguments

data

A data.frame containing text and id where each row represent a document or a character vector of text containing documents.

...

Any other parameters.

text

A bare column name or a vector of documents.

doc_id

Id of documents, if omitted they are created dynamically assuming each element of text.

min_freq

Minimum term frequency to keep terms in.

lexicon

Name of a lexicon of stopwords, borrowed from stop_words.

return_doc_id

Whether to return document id (named list).

Details

Simply tokenises each document, removes punctuation, stop words, digits, and keeps only terms that appear more than min_freq across documents.

Value

A named list of documents where the names are the documents id.


news-r/gensimr documentation built on Jan. 9, 2021, 5:55 a.m.