compClass: Internal Supreme function

Description Usage Arguments Details Value Note Examples

Description

compClass fits a logistic classification model (from packages glmnet and caret) to the posterior topic compositions of each document as trained by Latent Dirichelet Allocation. A classes variable is used as a classification variable. compClass is called by the mcLDA function.

Usage

1
2
compClass(predictors, classes, inTraining, train.glmnet = FALSE,
  cv.parallel = FALSE, train.parallel = FALSE)

Arguments

predictors

the matrix of predictors, i. e. the posterior topic compositions of each document.

classes

factor, the classification variable.

inTraining

the numeric ids of documents belonging to the training set.

train.glmnet

logical. If TRUE run train.glmnet function from package caret. Default is FALSE.

cv.parallel

logical. If TRUE parallel computation is used in Method1 with the maximum number of available cores. Default is FALSE.

train.parallel

logical. If TRUE parallel computation is used in Method2 with the maximum number of available cores. Default is FALSE.

Details

This function recognizes the compositional nature of the predictors and applies the principle of working on coordinates when facing with compositional data. Isometric log-ratio transformed versions of the predictors (by the ilr function from package compositions) are provided as input to the classification model.

We considered three different methods.

Method0 and Method1 are respectively built on the functions glmnet and cv.glmnet from package glmnet. Method2 refers to the function train.glmnet from package caret. Method0 tends to overfit the training set. Method1 and Method2 try to avoid overfitting problems using cross-validation. Method2 uses repeated cross-validation and is more stable than Method1 but much more time-consuming (parallel computation is allowed).

Value

err list of misclassification errors (error = 1 - Accuracy) and confusion matrices (from package caret):

e0.train

train error from method "glmnet"

e1.train

train error from method "cv.glmnet"

e2.train

train error from method "train.glmnet"

e0.test

test error from method "predict.glmnet"

e1.test

test error from method "predict.cv.glmnet"

e2.test

test error from method "predict.train.glmnet"

cm0

confusion matrix for method "glmnet"

cm1

confusion matrix for method "cv.glmnet"

cm1

confusion matrix for method "train.glmnet"

Note

Tuning parameters are alpha and lambda. Method0 and Method1 pick no value for alpha and it remains at default value alpha = 1. Method2 selects values for alpha and lambda using the tuning parameter grid defined by expand.grid(alpha = seq(0.1, 1, 0.1), lambda = glmnetFit0$lambda). More details can be found here. In Method1 the best model is selected using the "one standard error rule": default best value of the penalty parameter lambda is s = "lambda.1se", stored on the cv.glmnet object. Such a rule takes a conservative approach. Alternatively s = "lambda.min" can be used. Full details are given in "The Elements of Statistical Learnings" (T. Hastie, R. Tibshirani, J. Friedman) 2nd edition p. 61. Insights on compositions and their use in R can be found in "Analyzing compositional data with R" (K. Gerald van den Boogaart, Raimon Tolosana-Delgado) Springer-Verlag 2013.

Examples

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
## Not run: 
library(Supreme)
library(topicmodels)

# Input data.
data("dtm")
data("classes")

# Reduced dtm.lognet
dtm.lognet <- reduce_dtm(dtm, method = "lognet", classes = classes, export = TRUE)

# Run a 35-topic model over the reduced dtm.lognet and compute the topic posteriors.
ldaVEM.mod <- LDA(dtm.lognet$reduced, k = 35, method = "VEM", control = list(seed = 2014))
topic.posteriors <- posterior(ldaVEM.mod)$topics

# Misclassification errors.
set.seed(2010)  # for inTraining reproducibility
inTraining <- caret::createDataPartition(as.factor(classes), p = 0.75, list = FALSE)  # for balancing the size of target classes in training set
mis.error <- compClass(topic.posteriors, classes, inTraining)

## End(Not run)

paolofantini/Supreme documentation built on May 24, 2019, 6:14 p.m.