| text_to_DTM | R Documentation |
A Document Term Matrix (DTM) is a structure describing the association of a term to a document. In this case, we used a binary matrix with ones if a term is present in a document and one otherwise.
text_to_DTM(
corpus,
min.freq = 20,
ids = 1:length(corpus),
freq.subset.ids = ids,
included.pos = c("Noun", "Verb", "Adjective"),
tokenize.fun = tokenize_text,
add.ngrams = TRUE,
aggr.synonyms = TRUE,
n.gram.thresh = 0.5,
syn.thresh = 0.9,
label = "TERM__",
na.as.missing = TRUE
)
corpus |
A vector of text documents. |
min.freq |
Minimum number of document in which a term need to be present to be considered. |
ids |
Identification ID of documents. |
freq.subset.ids |
IDs to consider when computing term frequency. |
included.pos |
Part of speech (POS) to consider when building the DTM.
See |
tokenize.fun |
Function to use to clean up text. |
add.ngrams |
Whether to search and add non-consecutive n-grams. See
|
aggr.synonyms |
Whether to aggregate terms which almost always appear
together. See |
n.gram.thresh |
The threshold to use to identify the network of
non-consecutive n-grams if |
syn.thresh |
The threshold to use to identify the network of terms to
aggregate if |
label |
A label to prepend to term columns in the DTM. |
na.as.missing |
Whether to set as |
Before computing the DTM, document terms are cleaned, tokenized and lemmatized, and stop-words are removed.
To reduce noise, only terms that appear in a fraction of documents higher
than min.freq are considered. The function also uses cosine similarity
to identify relevant subclusters of related terms or redundant ones.
A Document Term Matrix with a row for each document and a column for the terms plus a column with the document IDs.
## Not run:
Records <- import_data(get_session_files("Session1")$Records)
Title_DTM <- with(
Records,
text_to_DTM(Title,
min.freq = 20, label = "TITLE__", ids = ID,
freq.subset.ids = ID[Target %in% c("y", "n")]
)
)
## End(Not run)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.