reduce_dtm_lognet_cv: Internal Supreme function

Description Usage Arguments Details Value Note Examples

Description

reduce_dtm_lognet_cv reduces the number of terms (columns) of a labeled document-term matrix. reduce_dtm_lognet_cv is called by the reduce_dtm function.

Usage

1
2
reduce_dtm_lognet_cv(dtm, classes, lambda = c("lambda.min", "lambda.1se"),
  SEED, c_normalize = TRUE, parallel = TRUE, export = FALSE)

Arguments

dtm

a document-term matrix in term frequency format.

classes

factor, the labeling variable.

lambda

a string with the selection rule of the optimal fit.

SEED

integer, the random seed for selecting train and test sets.

c_normalize

logical. If TRUE dtm entries are (cosine) normalized. Default is TRUE.

parallel

logical. If TRUE parallel cross-validation is performed. Default is TRUE.

export

logical. If TRUE export the discarded terms, the vocabulary and the returned object to the built-in directory data/dtm. Default is FALSE.

Details

This function fits a logistic classification model via penalized maximum likelihood by calling the lognet function from package glmnet. The regularization path is only computed for the lasso penalty at a grid of values for the regularization parameter lambda. If c_normalize = TRUE (default) the dtm is passed for cosine normalization to the wTfIdf function. Reduction of number of terms is performed by selecting only columns corresponding to the non zero beta coefficients in the optimal fit.

Value

a list with the reduced dtm (in term frequency format), the IDs of documents belonging to the training set, the glmnet fit object, the position of the best lambda, the selected terms by class, and the train and test misclassification errors err1.train and err1.test. Confusion matrix is also returned.

Note

Tuning parameters alpha and lambda are respectively set in the optimal fit to 1 (default) and one out of lambda.min or lambda.1se. The latter follows from the "minimum training error rule" and the former from the more conservative approach of the "one standard error rule". Full details are given in "The Elements of Statistical Learnings" (T. Hastie, R. Tibshirani, J. Friedman) 2nd edition p. 61.

Examples

1
2
3
4
5
6
7
## Not run: 
library(Supreme)
data("dtm")
data("classes")
dtm.lognet.cv <- reduce_dtm_lognet_cv(dtm, classes, lambda = "lambda.1se", SEED = 123)

## End(Not run)

paolofantini/Supreme documentation built on May 24, 2019, 6:14 p.m.