dfm_trim | R Documentation |
Returns a document by feature matrix reduced in size based on document and term frequency, usually in terms of a minimum frequency, but may also be in terms of maximum frequencies. Setting a combination of minimum and maximum frequencies will select features based on a range.
Feature selection is implemented by considering features across
all documents, by summing them for term frequency, or counting the
documents in which they occur for document frequency. Rank and quantile
versions of these are also implemented, for taking the first n
features in terms of descending order of overall global counts or document
frequencies, or as a quantile of all frequencies.
dfm_trim(
x,
min_termfreq = NULL,
max_termfreq = NULL,
termfreq_type = c("count", "prop", "rank", "quantile"),
min_docfreq = NULL,
max_docfreq = NULL,
docfreq_type = c("count", "prop", "rank", "quantile"),
sparsity = NULL,
verbose = quanteda_options("verbose")
)
x |
a dfm object |
min_termfreq , max_termfreq |
minimum/maximum values of feature frequencies across all documents, below/above which features will be removed |
termfreq_type |
how |
min_docfreq , max_docfreq |
minimum/maximum values of a feature's document frequency, below/above which features will be removed |
docfreq_type |
specify how |
sparsity |
equivalent to |
verbose |
print messages |
A dfm reduced in features (with the same number of documents)
Trimming a dfm object is an operation based on the values
in the document-feature matrix. To select subsets of a dfm based on the
features themselves (meaning the feature labels from
featnames()
) – such as those matching a regular expression, or
removing features matching a stopword list, use dfm_select()
.
dfm_select()
, dfm_sample()
dfmat <- dfm(tokens(data_corpus_inaugural))
# keep only words occurring >= 10 times and in >= 2 documents
dfm_trim(dfmat, min_termfreq = 10, min_docfreq = 2)
# keep only words occurring >= 10 times and in at least 0.4 of the documents
dfm_trim(dfmat, min_termfreq = 10, min_docfreq = 0.4, docfreq_type = "prop")
# keep only words occurring <= 10 times and in <=2 documents
dfm_trim(dfmat, max_termfreq = 10, max_docfreq = 2)
# keep only words occurring <= 10 times and in at most 3/4 of the documents
dfm_trim(dfmat, max_termfreq = 10, max_docfreq = 0.75, docfreq_type = "prop")
# keep only words occurring 5 times in 1000, and in 2 of 5 of documents
dfm_trim(dfmat, min_docfreq = 0.4, min_termfreq = 0.005, termfreq_type = "prop")
## Not run:
# compare to removeSparseTerms from the tm package
(dfmattm <- convert(dfmat, "tm"))
tm::removeSparseTerms(dfmattm, 0.7)
dfm_trim(dfmat, min_docfreq = 0.3)
dfm_trim(dfmat, sparsity = 0.7)
## End(Not run)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.