| as.TermDocumentMatrix | R Documentation |
Methods to generate the classes TermDocumentMatrix or
DocumentTermMatrix as defined in the tm package. There are
many text mining applications for document-term matrices. A
DocumentTermMatrix is required as input by the topicmodels
package, for instance.
as.TermDocumentMatrix(x, ...)
as.DocumentTermMatrix(x, ...)
## S4 method for signature 'character'
as.TermDocumentMatrix(x, p_attribute, s_attribute, verbose = TRUE, ...)
## S4 method for signature 'corpus'
as.DocumentTermMatrix(
x,
p_attribute,
s_attribute,
stoplist = NULL,
binarize = FALSE,
verbose = TRUE,
...
)
## S4 method for signature 'character'
as.DocumentTermMatrix(x, p_attribute, s_attribute, verbose = TRUE, ...)
## S4 method for signature 'bundle'
as.TermDocumentMatrix(x, col, p_attribute = NULL, verbose = TRUE, ...)
## S4 method for signature 'bundle'
as.DocumentTermMatrix(x, col = NULL, p_attribute = NULL, verbose = TRUE, ...)
## S4 method for signature 'partition_bundle'
as.DocumentTermMatrix(x, p_attribute = NULL, col = NULL, verbose = TRUE, ...)
## S4 method for signature 'partition_bundle'
as.TermDocumentMatrix(x, p_attribute = NULL, col = NULL, verbose = TRUE, ...)
## S4 method for signature 'subcorpus_bundle'
as.TermDocumentMatrix(x, p_attribute = NULL, verbose = TRUE, ...)
## S4 method for signature 'subcorpus_bundle'
as.DocumentTermMatrix(x, p_attribute = NULL, verbose = TRUE, ...)
## S4 method for signature 'partition_bundle'
as.DocumentTermMatrix(x, p_attribute = NULL, col = NULL, verbose = TRUE, ...)
## S4 method for signature 'context'
as.DocumentTermMatrix(x, p_attribute, verbose = TRUE, ...)
## S4 method for signature 'context'
as.TermDocumentMatrix(x, p_attribute, verbose = TRUE, ...)
x |
A |
... |
Definitions of s-attribute used for subsetting the corpus, compare partition-method. |
p_attribute |
A p-attribute counting is be based on. |
s_attribute |
An s-attribute that defines content of columns, or rows. |
verbose |
A |
stoplist |
A |
binarize |
A |
col |
The column of |
If x refers to a corpus (i.e. is a length 1 character vector), a
TermDocumentMatrix, or DocumentTermMatrix will be generated for
subsets of the corpus based on the s_attribute provided. Counts are
performed for the p_attribute. Further parameters provided (passed in
as ... are interpreted as s-attributes that define a subset of the
corpus for splitting it according to s_attribute. If struc values for
s_attribute are not unique, the necessary aggregation is performed, slowing
things somewhat down.
If x is a bundle or a class inheriting from it, the counts or
whatever measure is present in the stat slots (in the column
indicated by col) will be turned into the values of the sparse
matrix that is generated. A special case is the generation of the sparse
matrix based on a partition_bundle that does not yet include counts.
In this case, a p_attribute needs to be provided. Then counting will
be performed, too.
If x is a partition_bundle, and argument col is
not NULL, as TermDocumentMatrix is generated based on the
column indicated by col of the data.table with counts in the
stat slots of the objects in the bundle. If col is
NULL, the p-attribute indicated by p_attribute is decoded,
and a count is performed to obtain the values of the resulting
TermDocumentMatrix. The same procedure applies to get a
DocumentTermMatrix.
If x is a subcorpus_bundle, the p-attribute provided
by argument p_attribute is decoded, and a count is performed to
obtain the resulting TermDocumentMatrix or
DocumentTermMatrix.
A TermDocumentMatrix, or a DocumentTermMatrix object.
These classes are defined in the tm package, and inherit from the
simple_triplet_matrix-class defined in the slam-package.
Andreas Blaette
# examples not run by default to save time on CRAN test machines
#' use(pkg = "RcppCWB", corpus = "REUTERS")
# enriching partition_bundle explicitly
tdm <- corpus("REUTERS") %>%
partition_bundle(s_attribute = "id") %>%
enrich(p_attribute = "word") %>%
as.TermDocumentMatrix(col = "count")
# leave the counting to the as.TermDocumentMatrix-method
tdm <- partition_bundle("REUTERS", s_attribute = "id") %>%
as.TermDocumentMatrix(p_attribute = "word", verbose = FALSE)
# obtain TermDocumentMatrix directly (fastest option)
tdm <- as.TermDocumentMatrix(
"REUTERS",
p_attribute = "word",
s_attribute = "id",
verbose = FALSE
)
# workflow using split()
dtm <- corpus("REUTERS") %>%
split(s_attribute = "id") %>%
as.TermDocumentMatrix(p_attribute = "word")
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.