text_to_DTM: Convert a vector of text documents into a Document Term...

View source: R/NLP.R

text_to_DTMR Documentation

Convert a vector of text documents into a Document Term Matrix

Description

A Document Term Matrix (DTM) is a structure describing the association of a term to a document. In this case, we used a binary matrix with ones if a term is present in a document and one otherwise.

Usage

text_to_DTM(
  corpus,
  min.freq = 20,
  ids = 1:length(corpus),
  freq.subset.ids = ids,
  included.pos = c("Noun", "Verb", "Adjective"),
  tokenize.fun = tokenize_text,
  add.ngrams = TRUE,
  aggr.synonyms = TRUE,
  n.gram.thresh = 0.5,
  syn.thresh = 0.9,
  label = "TERM__",
  na.as.missing = TRUE
)

Arguments

corpus

A vector of text documents.

min.freq

Minimum number of document in which a term need to be present to be considered.

ids

Identification ID of documents.

freq.subset.ids

IDs to consider when computing term frequency.

included.pos

Part of speech (POS) to consider when building the DTM. See lexicon::hash_grady_pos() for a list of recognized POS.

tokenize.fun

Function to use to clean up text.

add.ngrams

Whether to search and add non-consecutive n-grams. See DTM.add_ngrams().

aggr.synonyms

Whether to aggregate terms which almost always appear together. See DTM.aggr_synonyms().

n.gram.thresh

The threshold to use to identify the network of non-consecutive n-grams if add.ngrams is TRUE.

syn.thresh

The threshold to use to identify the network of terms to aggregate if aggr.synonyms is TRUE.

label

A label to prepend to term columns in the DTM.

na.as.missing

Whether to set as NA the DTM cells for empty document. If FALSE those cells will be set to zero.

Details

Before computing the DTM, document terms are cleaned, tokenized and lemmatized, and stop-words are removed.

To reduce noise, only terms that appear in a fraction of documents higher than min.freq are considered. The function also uses cosine similarity to identify relevant subclusters of related terms or redundant ones.

Value

A Document Term Matrix with a row for each document and a column for the terms plus a column with the document IDs.

Examples

## Not run: 

Records <- import_data(get_session_files("Session1")$Records)

Title_DTM <- with(
  Records,
  text_to_DTM(Title,
    min.freq = 20, label = "TITLE__", ids = ID,
    freq.subset.ids = ID[Target %in% c("y", "n")]
  )
)

## End(Not run)

bakaburg1/BaySREn documentation built on March 30, 2022, 12:16 a.m.