Description Usage Arguments Details Value References Examples
Returns a sparse document-term matrix calculated from a given TIF[1] compliant token data frame
or object of class kRp.text
. You can also
calculate the term frequency inverted document frequency value (tf-idf) for each term.
1 2 3 4 5 6 7 8 | docTermMatrix(obj, terms = "token", case.sens = FALSE, tfidf = FALSE, ...)
## S4 method for signature 'data.frame'
docTermMatrix(obj, terms = "token", case.sens = FALSE,
tfidf = FALSE)
## S4 method for signature 'kRp.text'
docTermMatrix(obj, terms = "token", case.sens = FALSE, tfidf = FALSE)
|
obj |
Either an object of class |
terms |
A character string defining the |
case.sens |
Logical, whether terms should be counted case sensitive. |
tfidf |
Logical,
if |
... |
Additional arguments depending on the particular method. |
This is usually more interesting if done with more than one single text. If you're interested
in full corpus analysis, the tm.plugin.koRpus
package should be worth checking out.
Alternatively, a data frame with multiple doc_id
entries can be used.
See the examples to learn how to limit the analysis to desired word classes.
A sparse matrix of class dgCMatrix
.
[1] Text Interchange Formats (https://github.com/ropensci/tif) [2] tm.plugin.koRpus: https://CRAN.R-project.org/package=tm.plugin.koRpus
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 | # code is only run when the english language package can be loaded
if(require("koRpus.lang.en", quietly = TRUE)){
sample_file <- file.path(
path.package("koRpus"), "examples", "corpus", "Reality_Winner.txt"
)
# of course this makes more sense with a corpus of
# multiple texts, see the tm.plugin.koRpus[2] package
# for that
tokenized.obj <- tokenize(
txt=sample_file,
lang="en"
)
# get the document-term frequencies in a sparse matrix
myDTMatrix <- docTermMatrix(tokenized.obj)
# combine with filterByClass() to, e.g., exclude all punctuation
myDTMatrix <- docTermMatrix(filterByClass(tokenized.obj))
# instead of absolute frequencies, get the tf-idf values
myDTMatrix <- docTermMatrix(
filterByClass(tokenized.obj),
tfidf=TRUE
)
} else {}
|
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.