Description Usage Arguments Details Value Methods (by generic) Author(s) References See Also Examples
This function extends the glmnet
and cv.glmnet
functions from the glmnet
package. It uses cross validation to identify optimal elastic-net parameters and a
threshold parameter for binary classification, where optimality is defined
by minimizing an arbitrary, user-specified discrete loss function.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 | glmnetLRC(truthLabels, predictors, lossMat = "0-1", lossWeight = rep(1,
NROW(predictors)), alphaVec = seq(0, 1, by = 0.2), tauVec = seq(0.1, 0.9,
by = 0.05), cvFolds = 5, cvReps = 100, stratify = FALSE,
masterSeed = 1, nJobs = 1, estimateLoss = FALSE, verbose = FALSE, ...)
## S3 method for class 'glmnetLRC'
print(x, verbose = TRUE, ...)
## S3 method for class 'glmnetLRC'
plot(x, ...)
## S3 method for class 'glmnetLRC'
coef(object, tol = 1e-10, ...)
## S3 method for class 'glmnetLRC'
predict(object, newdata, truthCol = NULL,
keepCols = NULL, ...)
## S3 method for class 'glmnetLRC'
missingpreds(object, newdata, ...)
## S3 method for class 'glmnetLRC'
extract(object, ...)
|
truthLabels |
A factor with two levels containing the true labels for each observation. If it is more desirable to correctly predict one of the two classes over the other, the second level of this factor should be the class you are most interested in predicting correctly. |
predictors |
A matrix whose columns are the explanatory regression variables. Note: factors are not currently supported. To include a factor variable with n levels, it must be represented as n-1 dummy variables in the matrix. |
lossMat |
Either the character string |
lossWeight |
A vector of non-negative weights used to calculate the expected loss. The default value is 1 for each observation. |
alphaVec |
A sequence in [0, 1] designating possible values for the elastic-net mixing parameter,
α. A value of α = 1 is the lasso penalty, α = 0 is the ridge penalty.
Refer to |
tauVec |
A sequence of τ threshold values in (0, 1) for the
logistic regression classifier. For a new observation, if the predicted probability
that the observation belongs to the second level
of |
cvFolds |
The number of cross validation folds.
|
cvReps |
The number of cross validation replicates, i.e., the number
of times to repeat the cross validation
by randomly repartitioning the data into folds and estimating the tuning parameters.
For L.O.O. cross validation, this argument is set to |
stratify |
A logical indicating whether stratified sampling should be used
to ensure that observations from
both levels of |
masterSeed |
The random seed used to generate unique (and repeatable) seeds for each cross validation replicate. |
nJobs |
The number of cores on the local host
to use in parallelizing the training. Parallelization
takes place at the |
estimateLoss |
A logical, set to |
verbose |
For |
x |
For the |
object |
For the |
tol |
A small positive number, such that coefficients with an absolute value smaller than
|
newdata |
A dataframe or matrix containing the new set of observations to
be predicted, as well as an optional column of true labels.
|
truthCol |
The column number or column name in |
keepCols |
A numeric vector of column numbers (or a character vector of
column names) in |
... |
For |
For a given partition of the training data, cross validation is
performed to estimate the optimal values of
α (the mixing parameter of the ridge and lasso penalties) and λ
(the regularization parameter), as well as the optimal threshold, τ,
which is used to dichotomize the probability predictions of the elastic-net
logistic regression model into binary outcomes.
(Specifically, if the probability an observation
belongs to the second level of truthLabels
exceeds τ, it is
classified as belonging to that second level). In this case, optimality is defined
as the set of parameters that minimize the risk, or expected loss, where the
loss function created using lossMatrix
. The expected loss is calculated such
that each observation in the data receives equal weight
glmnetLRC()
searches for the optimal values of α and τ by
fitting the elastic-net model at the points of the two-dimensional grid defined by
alphaVec
and tauVec
. For each value of α, the vector of
λ values is selected automatically by glmnet
according to its default
arguments. The expected loss is calculated for each (α,λ,τ) triple, and the
triple giving rise to the lowest risk designates the optimal model for a given
cross validation partition, or cross validation replicate, of the data.
This process is repeated cvReps
times, where each time a different random
partition of the data is created using its own seed, resulting in another
'optimal' estimate of (α,λ,τ). The final estimate of
(α,λ,τ) is given by the respective medians of those estimates.
The final elastic-net logistic regression classfier is given by fitting the regression
coefficients to all the training data using the optimal (α,λ,τ).
The methodology is discussed in detail in the online package documentation.
An object of class glmnetLRC
, which
inherits from classes lognet
and glmnet
. It contains the
object returned by glmnet
that has been fit to all the data using
the optimal parameters (α, λ, τ).
It also contains the following additional elements:
The loss matrix used as the criteria for selecting optimal tuning parameters
A data fame that contains the tuning parameter estimates for (α, λ, τ) that minimize
the expected loss for each cross validation replicate. Used by the plot
method.
A named vector that contains the final estimates of (α, λ, τ), calculated as the
element-wise median of parms
If estimateLoss = TRUE
, this element is a data frame with the expected loss
for each cross validation replicate
print
: Displays the overall optimized values of
(α, λ, τ), with the corresponding degrees of freedom and
deviance for the model fit to all the data using the optimzed parameters. If estimateLoss = TRUE
when glmnetLRC()
was called, the mean and standard deviation of the expected loss are also shown.
In addition, all of this same information is returned invisibly as a matrix. Display of the information
can be suppressed by setting verbose = FALSE
in the call to print
.
plot
: Produces a pairs plot of the tuning parameters
(α, λ, τ) and their univariate histograms that
were identified as optimal for each of of the cross validation replicates.
This can provide a sense of the stability of the estimates of the tuning
parameters.
coef
: Calls the predict
method in glmnet
on the fitted glmnet object and returns a named vector of the non-zero elastic-net logistic
regression coefficients using the optimal values of α and λ.
predict
: Predict (or classify) new data from an glmnetLRC
object.
Returns an object of class LRCpred
(which inherits
from data.frame
) that contains the predicted probabilities (Prob
) and class (predictClass
)
for each observation. The Prob
column corresponds to the predicted probability that an observation belongs
to the second level of truthLabels
. The columns indicated by truthCol
and keepCols
are included
if they were requested. The LRCpred
class has two methods: summary.LRCpred
and plot.LRCpred
.
missingpreds
: Identify the set of predictors in a glmnetLRC
object that are not
present in newdata
. Returns a character vector of the missing predictor names. If no predictors are missing,
it returns character(0)
.
extract
: Extracts the glmnet
object that was fit using the optimal parameter estimates of
(α, λ). Returns an object of class "lognet" "glmnet"
that can be passed to various
methods available in the glmnet
package.
Landon Sego, Alex Venzin
Amidan BG, Orton DJ, LaMarche BL, Monroe ME, Moore RJ, Venzin AM, Smith RD, Sego LH, Tardiff MF, Payne SH. 2014. Signatures for Mass Spectrometry Data Quality. Journal of Proteome Research. 13(4), 2215-2222. http://pubs.acs.org/doi/abs/10.1021/pr401143e
Friedman J, Hastie T, Tibshirani R. 2010. Regularization Paths for Generalized Linear Models via Coordinate Descent. Journal of Statistical Software. 33(1), 1-22.
summary.LRCpred
, a summary method for objects of class
LRCpred
, produced by the predict
method.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 | # Load the VOrbitrap Shewanella QC data from Amidan et al.
data(traindata)
# Here we select the predictor variables
predictors <- as.matrix(traindata[,9:96])
# The logistic regression model requires a binary response
# variable. We will create a factor variable from the
# Curated Quality measurements. Note how we put "poor" as the
# second level in the factor. This is because the principal
# objective of the classifer is to detect "poor" datasets
response <- factor(traindata$Curated_Quality,
levels = c("good", "poor"),
labels = c("good", "poor"))
# Specify the loss matrix. The "poor" class is the target of interest.
# The penalty for misclassifying a "poor" item as "good" results in a
# loss of 5.
lM <- lossMatrix(c("good","good","poor","poor"),
c("good","poor","good","poor"),
c( 0, 1, 5, 0))
# Display the loss matrix
lM
# Train the elastic-net classifier (we don't run it here because it takes a long time)
## Not run:
glmnetLRC_fit <- glmnetLRC(response, predictors, lossMat = lM, estimateLoss = TRUE,
nJobs = parallel::detectCores())
## End(Not run)
# We'll load the precalculated model fit instead
data(glmnetLRC_fit)
# Show the optimal parameter values
print(glmnetLRC_fit)
# Show the coefficients of the optimal model
coef(glmnetLRC_fit)
# Show the plot of all the optimal parameter values for each cross validation replicate
plot(glmnetLRC_fit)
# Extract the 'glmnet' object from the glmnetLRC fit
glmnetObject <- extract(glmnetLRC_fit)
# See how the glmnet methods operate on the object
plot(glmnetObject)
# Look at the coefficients for the optimal lambda
coef(glmnetObject, s = glmnetLRC_fit$optimalParms["lambda"] )
# Load the new observations
data(testdata)
# Use the trained model to make predictions about
# new observations for the response variable.
new <- predict(glmnetLRC_fit, testdata, truthCol = "Curated_Quality", keepCols = 1:2)
head(new)
# Now summarize the performance of the model
summary(new)
# And plot the probability predictions of the model
plot(new, scale = 0.5, legendArgs = list(x = "topright"))
# If predictions are made without an indication of the ground truth,
# the summary is necessarily simpler:
summary(predict(glmnetLRC_fit, testdata))
|
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.