text_to_DTM | R Documentation |
A Document Term Matrix (DTM) is a structure describing the association of a term to a document. In this case, we used a binary matrix with ones if a term is present in a document and one otherwise.
text_to_DTM( corpus, min.freq = 20, ids = 1:length(corpus), freq.subset.ids = ids, included.pos = c("Noun", "Verb", "Adjective"), tokenize.fun = tokenize_text, add.ngrams = TRUE, aggr.synonyms = TRUE, n.gram.thresh = 0.5, syn.thresh = 0.9, label = "TERM__", na.as.missing = TRUE )
corpus |
A vector of text documents. |
min.freq |
Minimum number of document in which a term need to be present to be considered. |
ids |
Identification ID of documents. |
freq.subset.ids |
IDs to consider when computing term frequency. |
included.pos |
Part of speech (POS) to consider when building the DTM.
See |
tokenize.fun |
Function to use to clean up text. |
add.ngrams |
Whether to search and add non-consecutive n-grams. See
|
aggr.synonyms |
Whether to aggregate terms which almost always appear
together. See |
n.gram.thresh |
The threshold to use to identify the network of
non-consecutive n-grams if |
syn.thresh |
The threshold to use to identify the network of terms to
aggregate if |
label |
A label to prepend to term columns in the DTM. |
na.as.missing |
Whether to set as |
Before computing the DTM, document terms are cleaned, tokenized and lemmatized, and stop-words are removed.
To reduce noise, only terms that appear in a fraction of documents higher
than min.freq
are considered. The function also uses cosine similarity
to identify relevant subclusters of related terms or redundant ones.
A Document Term Matrix with a row for each document and a column for the terms plus a column with the document IDs.
## Not run: Records <- import_data(get_session_files("Session1")$Records) Title_DTM <- with( Records, text_to_DTM(Title, min.freq = 20, label = "TITLE__", ids = ID, freq.subset.ids = ID[Target %in% c("y", "n")] ) ) ## End(Not run)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.