Description Usage Arguments Details Value Note Examples
reduce_dtm
reduces the number of columns (terms) of a document-term matrix.
1 2 3 |
dtm |
a document-term matrix in term frequency format. |
method |
the method for selecting the columns. |
q |
a list with |
classes |
factor. The labeling variable. Only use for |
lambda |
a string with the selection rule of the optimal fit. Only use for |
SEED |
integer, the random seed for selecting train and test set. Only use for |
c_normalize |
a Boolean value indicating whether the |
parallel |
logical. If |
export |
logical. If |
This function is a wrapper for applying three different methods for dimensionality reduction of a document-term matrix:
It calls the reduce_dtm_tfidf
function to select suitable columns of an unlabeled document-term matrix by deleting
terms whith tf-idf score out of an user defined range.
It calls the reduce_dtm_lognet
function to apply lognet
, a logistic classification method from package glmnet,
to a labeled document-term matrix.
It calls the reduce_dtm_lognet_cv
function to apply the former lognet
method via (parallel) cross-validation.
A list as in reduce_dtm_tfidf
.
A list as in reduce_dtm_lognet
.
A list as in reduce_dtm_lognet_cv
.
From Wikipedia: tfidf, short for term frequency inverse document frequency, is a numerical statistic that is intended to reflect how important a word is to a document in a collection or corpus. It is often used as a weighting factor in information retrieval and text mining. The tfidf value increases proportionally to the number of times a word appears in the document, but is offset by the frequency of the word in the corpus, which helps to control for the fact that some words are generally more common than others.
In the optimal fit of the lognet
method the tuning parameters alpha
and lambda
are respectively set to 1
(default) and one out of lambda.min
or lambda.1se
.
The latter follows from the "minimum training error rule" and the former from the more conservative approach of the "one standard error rule".
Full details are given in "The Elements of Statistical Learnings" (T. Hastie, R. Tibshirani, J. Friedman) 2nd edition p. 61.
Dimensionality reduction is performed by selecting only columns (terms) corresponding to non zero
beta coefficients in the optimal fit.
discardedTerms.txt
and vocabulary.txt
respectively contain the rejected terms and the vocabulary (i.e. columns) of the reduced dtm
.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 | ## Not run:
### tfidf method
library(Supreme)
data("dtm")
dtm.tfidf <- reduce_dtm(dtm, method = "tfidf")
### lognet method
library(Supreme)
data("dtm")
data("classes")
dtm.lognet <- reduce_dtm(dtm, method = "lognet", classes = classes, SEED = 123)
### lognet_cv method
library(Supreme)
data("dtm")
data("classes")
dtm.lognet.cv <- reduce_dtm(dtm, method = "lognet_cv", classes = classes, lambda = "lambda.1se", SEED = 123)
## End(Not run)
|
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.