Description Usage Arguments Details Value Note Examples
compClass fits a logistic classification model (from packages glmnet and caret)
to the posterior topic compositions of each document as trained by Latent Dirichelet Allocation.
A classes variable is used as a classification variable.
compClass is called by the mcLDA function.
1 2 |
predictors |
the matrix of predictors, i. e. the posterior topic compositions of each document. |
classes |
factor, the classification variable. |
inTraining |
the numeric ids of documents belonging to the training set. |
train.glmnet |
logical. If |
cv.parallel |
logical. If |
train.parallel |
logical. If |
This function recognizes the compositional nature of the predictors
and applies the principle of working on coordinates when facing with compositional data.
Isometric log-ratio transformed versions of the predictors
(by the ilr function from package compositions) are provided as input to the classification model.
We considered three different methods.
Method0 and Method1 are respectively built on the functions glmnet and cv.glmnet from package glmnet.
Method2 refers to the function train.glmnet from package caret.
Method0 tends to overfit the training set.
Method1 and Method2 try to avoid overfitting problems using cross-validation.
Method2 uses repeated cross-validation and is more stable than Method1 but much more time-consuming (parallel computation is allowed).
err list of misclassification errors (error = 1 - Accuracy) and confusion matrices (from package caret):
e0.train |
train error from method "glmnet" |
e1.train |
train error from method "cv.glmnet" |
e2.train |
train error from method "train.glmnet" |
e0.test |
test error from method "predict.glmnet" |
e1.test |
test error from method "predict.cv.glmnet" |
e2.test |
test error from method "predict.train.glmnet" |
cm0 |
confusion matrix for method "glmnet" |
cm1 |
confusion matrix for method "cv.glmnet" |
cm1 |
confusion matrix for method "train.glmnet" |
Tuning parameters are alpha and lambda.
Method0 and Method1 pick no value for alpha and it remains at default value alpha = 1.
Method2 selects values for alpha and lambda using the tuning parameter grid defined by
expand.grid(alpha = seq(0.1, 1, 0.1), lambda = glmnetFit0$lambda).
More details can be found here.
In Method1 the best model is selected using the "one standard error rule":
default best value of the penalty parameter lambda is s = "lambda.1se", stored on the cv.glmnet object.
Such a rule takes a conservative approach. Alternatively s = "lambda.min" can be used.
Full details are given in "The Elements of Statistical Learnings" (T. Hastie, R. Tibshirani, J. Friedman) 2nd edition p. 61.
Insights on compositions and their use in R can be found in "Analyzing compositional data with R"
(K. Gerald van den Boogaart, Raimon Tolosana-Delgado) Springer-Verlag 2013.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 | ## Not run:
library(Supreme)
library(topicmodels)
# Input data.
data("dtm")
data("classes")
# Reduced dtm.lognet
dtm.lognet <- reduce_dtm(dtm, method = "lognet", classes = classes, export = TRUE)
# Run a 35-topic model over the reduced dtm.lognet and compute the topic posteriors.
ldaVEM.mod <- LDA(dtm.lognet$reduced, k = 35, method = "VEM", control = list(seed = 2014))
topic.posteriors <- posterior(ldaVEM.mod)$topics
# Misclassification errors.
set.seed(2010) # for inTraining reproducibility
inTraining <- caret::createDataPartition(as.factor(classes), p = 0.75, list = FALSE) # for balancing the size of target classes in training set
mis.error <- compClass(topic.posteriors, classes, inTraining)
## End(Not run)
|
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.