getvocab: Extract words and phrases from a corpus

getvocabR Documentation

Extract words and phrases from a corpus

Description

Extract words and phrases from a corpus of documents.

Usage

getvocab(
  corpus,
  mincount = 5,
  minphrasecount = NULL,
  ngram = 1,
  lang = "en",
  stopwords = lang,
  ...
)

Arguments

corpus

The corpus of documents (a vector of characters).

mincount

Minimum word count to be considered as frequent.

minphrasecount

Minimum collocation of words count to be considered as frequent.

ngram

maximum size of n-grams.

lang

The language of the documents (NULL if no stemming).

stopwords

Stopwords, or the language of the documents. NULL if stop words should not be removed.

...

Other parameters.

Value

The vocabulary used in the corpus of documents.

See Also

plotzipf, stopwords, create_vocabulary

Examples

## Not run: 
text = loadtext ("http://mattmahoney.net/dc/text8.zip")
vocab1 = getvocab (text) # With stemming
nrow (vocab1)
vocab2 = getvocab (text, lang = NULL) # Without stemming
nrow (vocab2)

## End(Not run)

fdm2id documentation built on July 9, 2023, 6:05 p.m.

Related to getvocab in fdm2id...