Description Usage Arguments Details Value Note Examples
compClass
fits a logistic classification model (from packages glmnet and caret)
to the posterior topic compositions
of each document as trained by Latent Dirichelet Allocation.
A classes
variable is used as a classification variable.
compClass
is called by the mcLDA
function.
1 2 |
predictors |
the matrix of predictors, i. e. the posterior topic compositions of each document. |
classes |
factor, the classification variable. |
inTraining |
the numeric ids of documents belonging to the training set. |
train.glmnet |
logical. If |
cv.parallel |
logical. If |
train.parallel |
logical. If |
This function recognizes the compositional nature of the predictors
and applies the principle of working on coordinates when facing with compositional data.
Isometric log-ratio transformed versions of the predictors
(by the ilr
function from package compositions) are provided as input to the classification model.
We considered three different methods.
Method0 and Method1 are respectively built on the functions glmnet
and cv.glmnet
from package glmnet.
Method2 refers to the function train.glmnet
from package caret.
Method0 tends to overfit the training set.
Method1 and Method2 try to avoid overfitting problems using cross-validation.
Method2 uses repeated cross-validation and is more stable than Method1 but much more time-consuming (parallel computation is allowed).
err
list of misclassification errors (error = 1 - Accuracy) and confusion matrices (from package caret):
e0.train |
train error from method "glmnet" |
e1.train |
train error from method "cv.glmnet" |
e2.train |
train error from method "train.glmnet" |
e0.test |
test error from method "predict.glmnet" |
e1.test |
test error from method "predict.cv.glmnet" |
e2.test |
test error from method "predict.train.glmnet" |
cm0 |
confusion matrix for method "glmnet" |
cm1 |
confusion matrix for method "cv.glmnet" |
cm1 |
confusion matrix for method "train.glmnet" |
Tuning parameters are alpha
and lambda
.
Method0 and Method1 pick no value for alpha
and it remains at default value alpha = 1
.
Method2 selects values for alpha
and lambda
using the tuning parameter grid defined by
expand.grid(alpha = seq(0.1, 1, 0.1), lambda = glmnetFit0$lambda)
.
More details can be found here.
In Method1 the best model is selected using the "one standard error rule":
default best value of the penalty parameter lambda
is s = "lambda.1se"
, stored on the cv.glmnet
object.
Such a rule takes a conservative approach. Alternatively s = "lambda.min"
can be used.
Full details are given in "The Elements of Statistical Learnings" (T. Hastie, R. Tibshirani, J. Friedman) 2nd edition p. 61.
Insights on compositions and their use in R can be found in "Analyzing compositional data with R"
(K. Gerald van den Boogaart, Raimon Tolosana-Delgado) Springer-Verlag 2013.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 | ## Not run:
library(Supreme)
library(topicmodels)
# Input data.
data("dtm")
data("classes")
# Reduced dtm.lognet
dtm.lognet <- reduce_dtm(dtm, method = "lognet", classes = classes, export = TRUE)
# Run a 35-topic model over the reduced dtm.lognet and compute the topic posteriors.
ldaVEM.mod <- LDA(dtm.lognet$reduced, k = 35, method = "VEM", control = list(seed = 2014))
topic.posteriors <- posterior(ldaVEM.mod)$topics
# Misclassification errors.
set.seed(2010) # for inTraining reproducibility
inTraining <- caret::createDataPartition(as.factor(classes), p = 0.75, list = FALSE) # for balancing the size of target classes in training set
mis.error <- compClass(topic.posteriors, classes, inTraining)
## End(Not run)
|
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.