lma_dtm | R Documentation |
Creates a document-term matrix (dtm) from a set of texts.
lma_dtm(text, exclude = NULL, context = NULL, replace.special = FALSE,
numbers = FALSE, punct = FALSE, urls = TRUE, emojis = FALSE,
to.lower = TRUE, word.break = " +", dc.min = 0, dc.max = Inf,
sparse = TRUE, tokens.only = FALSE)
text |
Texts to be processed. This can be a vector (such as a column in a data frame)
or list. When a list, these can be in the form returned with | |||||||||
exclude |
A character vector of words to be excluded. If | |||||||||
context |
A character vector used to reformat text based on look- ahead/behind. For example,
you might attempt to disambiguate like by reformatting certain likes
(e.g., | |||||||||
replace.special |
Logical: if | |||||||||
numbers |
Logical: if | |||||||||
punct |
Logical: if | |||||||||
urls |
Logical: if | |||||||||
emojis |
Logical: if | |||||||||
to.lower |
Logical: if | |||||||||
word.break |
A regular expression string determining the way words are split. Default is
| |||||||||
dc.min |
Numeric: excludes terms appearing in the set number or fewer documents. Default is 0 (no limit). | |||||||||
dc.max |
Numeric: excludes terms appearing in the set number or more. Default is Inf (no limit). | |||||||||
sparse |
Logical: if | |||||||||
tokens.only |
Logical: if
|
A sparse matrix (or regular matrix if sparse = FALSE
), with a row per text
,
and column per term, or a list if tokens.only = TRUE
. Includes an attribute with options (opts
),
and attributes with word count (WC
) and column sums (colsums
) if tokens.only = FALSE
.
This is a relatively simple way to make a dtm. To calculate the (more or less) standard forms of LSM and LSS, a somewhat raw dtm should be fine, because both processes essentially use dictionaries (obviating stemming) and weighting or categorization (largely obviating 'stop word' removal). The exact effect of additional processing will depend on the dictionary/semantic space and weighting scheme used (particularly for LSA). This function also does some processing which may matter if you plan on categorizing with categories that have terms with look- ahead/behind assertions (like LIWC dictionaries). Otherwise, other methods may be faster, more memory efficient, and/or more featureful.
text <- c(
"Why, hello there! How are you this evening?",
"I am well, thank you for your inquiry!",
"You are a most good at social interactions person!",
"Why, thank you! You're not all bad yourself!"
)
lma_dtm(text)
# return tokens only
(tokens <- lma_dtm(text, tokens.only = TRUE))
## convert those to a regular DTM
lma_dtm(tokens)
# convert a list-representation to a sparse matrix
lma_dtm(list(
doc1 = c(why = 1, hello = 1, there = 1),
doc2 = c(i = 1, am = 1, well = 1)
))
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.