get_dfm: Get Document Feature Matrix

get_dfmR Documentation

Get Document Feature Matrix

Description

Builds document feature matrix using quanteda package.

Usage

get_dfm(
  docs,
  doc_name = "text",
  index_name = "id",
  stem = T,
  ngrams = 1,
  trimPct = 1e-04,
  min_doc_freq = 2,
  idfWeight = F,
  removeStopWords = T,
  minChar = 4
)

Arguments

docs

[matrix] Matrix of labeled and unlabeled documents.

doc_name

[character] Character string indicating the variable in 'docs' that denotes the text of the documents to be classified.

index_name

[character] Character string indicating the variable in 'docs' that denotes the index value of the document to be classified.

stem

[logical] Switch indicating whether or not to stem terms.

ngrams

[integer] Integer value indicating the size of the ngram to use to build the dfm.

trimPct

[numeric] Numeric value indicating the threshold of percentage of document membership at which to remove terms from the data-term matrix. E.g., if trimPct = .5, then all words that are in less than 50 percent of the documents will be removed.

min_doc_freq

[integer] Minimum number of documents a term must be in to stay in the document term matrix.

idfWeight

[logical] Switch indicating whether to weight the document term matrix by the frequency of word counts. Only works if dfmType = "quanteda".

Value

[matrix] Document term matrix.


activetext/activeR documentation built on May 31, 2024, 10:21 a.m.