glmnetLRC: Construct a lasso or elastic-net logistic regression...
In pnnl/glmnetLRC: Lasso and Elastic-Net Logistic Regression Classification with an Arbitrary Loss Function

Description Usage Arguments Details Value Methods (by generic) Author(s) References See Also Examples

This function extends the glmnet and cv.glmnet functions from the glmnet package. It uses cross validation to identify optimal elastic-net parameters and a threshold parameter for binary classification, where optimality is defined by minimizing an arbitrary, user-specified discrete loss function.

glmnetLRC(truthLabels, predictors, lossMat = "0-1", lossWeight = rep(1,
  NROW(predictors)), alphaVec = seq(0, 1, by = 0.2), tauVec = seq(0.1, 0.9,
  by = 0.05), cvFolds = 5, cvReps = 100, stratify = FALSE,
  masterSeed = 1, nJobs = 1, estimateLoss = FALSE, verbose = FALSE, ...)

## S3 method for class 'glmnetLRC'
print(x, verbose = TRUE, ...)

## S3 method for class 'glmnetLRC'
plot(x, ...)

## S3 method for class 'glmnetLRC'
coef(object, tol = 1e-10, ...)

## S3 method for class 'glmnetLRC'
predict(object, newdata, truthCol = NULL,
  keepCols = NULL, ...)

## S3 method for class 'glmnetLRC'
missingpreds(object, newdata, ...)

## S3 method for class 'glmnetLRC'
extract(object, ...)

`truthLabels`	A factor with two levels containing the true labels for each observation. If it is more desirable to correctly predict one of the two classes over the other, the second level of this factor should be the class you are most interested in predicting correctly.
`predictors`	A matrix whose columns are the explanatory regression variables. Note: factors are not currently supported. To include a factor variable with n levels, it must be represented as n-1 dummy variables in the matrix.
`lossMat`	Either the character string `"0-1"`, indicating 0-1 loss, or a loss matrix of class `lossMat`, produced by `lossMatrix`, that specifies the penalties for classification errors.
`lossWeight`	A vector of non-negative weights used to calculate the expected loss. The default value is 1 for each observation.
`alphaVec`	A sequence in [0, 1] designating possible values for the elastic-net mixing parameter, α. A value of α = 1 is the lasso penalty, α = 0 is the ridge penalty. Refer to `glmnet` for further information.
`tauVec`	A sequence of τ threshold values in (0, 1) for the logistic regression classifier. For a new observation, if the predicted probability that the observation belongs to the second level of `truthLabels` exceeds tau, the observation is classified as belonging to the second level.
`cvFolds`	The number of cross validation folds. `cvFolds = length(truthLabels)` gives leave-one-out (L.O.O.) cross validation, in which case `cvReps` is set to `1` and `stratify` is set to `FALSE`.
`cvReps`	The number of cross validation replicates, i.e., the number of times to repeat the cross validation by randomly repartitioning the data into folds and estimating the tuning parameters. For L.O.O. cross validation, this argument is set to `1` as there can only be one possible partition of the data.
`stratify`	A logical indicating whether stratified sampling should be used to ensure that observations from both levels of `truthLabels` are proportionally present in the cross validation folds. In other words, stratification attempts to ensure there are sufficient observations of each level of `truthLabels` in each training set to fit the model. Stratification may be required for small or imbalanced data sets. Note that stratification is not performed for L.O.O (when `cvFolds = length(truthLabels)`).
`masterSeed`	The random seed used to generate unique (and repeatable) seeds for each cross validation replicate.
`nJobs`	The number of cores on the local host to use in parallelizing the training. Parallelization takes place at the `cvReps` level, i.e., if `cvReps = 1`, parallelizing would do no good, whereas if `cvReps = 2`, each cross validation replicate would be run separately in its own thread if `nJobs = 2`. Parallelization is executed using parLapplyW() from the Smisc package.
`estimateLoss`	A logical, set to `TRUE` to calculate the average loss estimated via cross validation using the optimized parameters (α, λ, τ) to fit the elastic net model for each cross validation fold. This can be computationally expensive, as it requires another cross validation pass through the same partitions of the data, but using only the optimal parameters to estimate the loss for each cross validation replicate.
`verbose`	For `glmetLRC`, a logical to turn on (or off) messages regarding the progress of the training algorithm. For the `print` method, if set to `FALSE`, it will suppress printing information about the `glmnetLRC` object and only invisibly return the results.
`x`	For the `print` and `plot` methods: an object of class `glmnetLRC` (returned by `glmnetLRC()`), which contains the optimally-trained elastic-net logistic regression classifier.
`object`	For the `coef`, `predict`, and `extract` methods: an object of class `glmnetLRC` (returned by `glmnetLRC()`) which contains the optimally-trained elastic-net logistic regression classifier.
`tol`	A small positive number, such that coefficients with an absolute value smaller than `tol` are not returned.
`newdata`	A dataframe or matrix containing the new set of observations to be predicted, as well as an optional column of true labels. `newdata` should contain all of the column names that were used to fit the elastic-net logistic regression classifier.
`truthCol`	The column number or column name in `newdata` that contains the true labels, which should be a factor (and this implies `newdata` should be a dataframe if `truthCol` is provided). Optional.
`keepCols`	A numeric vector of column numbers (or a character vector of column names) in `newdata` that will be 'kept' and returned with the predictions. Optional.
`...`	For `glmnetLRC()`, these are additional arguments to `glmnet` in the `glmnet` package. Certain arguments of `glmnet` are reserved by the `glmnetLRC` package and an error message will make that clear if they are used. In particular, arguments that control the behavior of α and λ are reserved. For the `plot` method, the "..." are additional arguments to the default S3 method `pairs`. And for the `print`, `coef`, `predict`, `missingpreds`, and `extract` methods, the "..." are ignored.

For a given partition of the training data, cross validation is performed to estimate the optimal values of α (the mixing parameter of the ridge and lasso penalties) and λ (the regularization parameter), as well as the optimal threshold, τ, which is used to dichotomize the probability predictions of the elastic-net logistic regression model into binary outcomes. (Specifically, if the probability an observation belongs to the second level of truthLabels exceeds τ, it is classified as belonging to that second level). In this case, optimality is defined as the set of parameters that minimize the risk, or expected loss, where the loss function created using lossMatrix. The expected loss is calculated such that each observation in the data receives equal weight

glmnetLRC() searches for the optimal values of α and τ by fitting the elastic-net model at the points of the two-dimensional grid defined by alphaVec and tauVec. For each value of α, the vector of λ values is selected automatically by glmnet according to its default arguments. The expected loss is calculated for each (α,λ,τ) triple, and the triple giving rise to the lowest risk designates the optimal model for a given cross validation partition, or cross validation replicate, of the data.

This process is repeated cvReps times, where each time a different random partition of the data is created using its own seed, resulting in another 'optimal' estimate of (α,λ,τ). The final estimate of (α,λ,τ) is given by the respective medians of those estimates. The final elastic-net logistic regression classfier is given by fitting the regression coefficients to all the training data using the optimal (α,λ,τ).

The methodology is discussed in detail in the online package documentation.

An object of class glmnetLRC, which inherits from classes lognet and glmnet. It contains the object returned by glmnet that has been fit to all the data using the optimal parameters (α, λ, τ). It also contains the following additional elements:

lossMat: The loss matrix used as the criteria for selecting optimal tuning parameters
parms: A data fame that contains the tuning parameter estimates for (α, λ, τ) that minimize the expected loss for each cross validation replicate. Used by the plot method.
optimalParms: A named vector that contains the final estimates of (α, λ, τ), calculated as the element-wise median of parms
lossEstimates: If estimateLoss = TRUE, this element is a data frame with the expected loss for each cross validation replicate

print: Displays the overall optimized values of (α, λ, τ), with the corresponding degrees of freedom and deviance for the model fit to all the data using the optimzed parameters. If estimateLoss = TRUE when glmnetLRC() was called, the mean and standard deviation of the expected loss are also shown. In addition, all of this same information is returned invisibly as a matrix. Display of the information can be suppressed by setting verbose = FALSE in the call to print.
plot: Produces a pairs plot of the tuning parameters (α, λ, τ) and their univariate histograms that were identified as optimal for each of of the cross validation replicates. This can provide a sense of the stability of the estimates of the tuning parameters.
coef: Calls the predict method in glmnet on the fitted glmnet object and returns a named vector of the non-zero elastic-net logistic regression coefficients using the optimal values of α and λ.
predict: Predict (or classify) new data from an glmnetLRC object. Returns an object of class LRCpred (which inherits from data.frame) that contains the predicted probabilities (Prob) and class (predictClass) for each observation. The Prob column corresponds to the predicted probability that an observation belongs to the second level of truthLabels. The columns indicated by truthCol and keepCols are included if they were requested. The LRCpred class has two methods: summary.LRCpred and plot.LRCpred.
missingpreds: Identify the set of predictors in a glmnetLRC object that are not present in newdata. Returns a character vector of the missing predictor names. If no predictors are missing, it returns character(0).
extract: Extracts the glmnet object that was fit using the optimal parameter estimates of (α, λ). Returns an object of class "lognet" "glmnet" that can be passed to various methods available in the glmnet package.

Landon Sego, Alex Venzin

Amidan BG, Orton DJ, LaMarche BL, Monroe ME, Moore RJ, Venzin AM, Smith RD, Sego LH, Tardiff MF, Payne SH. 2014. Signatures for Mass Spectrometry Data Quality. Journal of Proteome Research. 13(4), 2215-2222. http://pubs.acs.org/doi/abs/10.1021/pr401143e

Friedman J, Hastie T, Tibshirani R. 2010. Regularization Paths for Generalized Linear Models via Coordinate Descent. Journal of Statistical Software. 33(1), 1-22.

summary.LRCpred, a summary method for objects of class LRCpred, produced by the predict method.

# Load the VOrbitrap Shewanella QC data from Amidan et al.
data(traindata)

# Here we select the predictor variables
predictors <- as.matrix(traindata[,9:96])

# The logistic regression model requires a binary response
# variable. We will create a factor variable from the
# Curated Quality measurements. Note how we put "poor" as the
# second level in the factor.  This is because the principal
# objective of the classifer is to detect "poor" datasets
response <- factor(traindata$Curated_Quality,
                   levels = c("good", "poor"),
                   labels = c("good", "poor"))

# Specify the loss matrix. The "poor" class is the target of interest.
# The penalty for misclassifying a "poor" item as "good" results in a
# loss of 5.
lM <- lossMatrix(c("good","good","poor","poor"),
                 c("good","poor","good","poor"),
                 c(     0,     1,     5,     0))

# Display the loss matrix
lM

# Train the elastic-net classifier (we don't run it here because it takes a long time)
## Not run: 
glmnetLRC_fit <- glmnetLRC(response, predictors, lossMat = lM, estimateLoss = TRUE,
                           nJobs = parallel::detectCores())

## End(Not run)

# We'll load the precalculated model fit instead
data(glmnetLRC_fit)

# Show the optimal parameter values
print(glmnetLRC_fit)

# Show the coefficients of the optimal model
coef(glmnetLRC_fit)

# Show the plot of all the optimal parameter values for each cross validation replicate
plot(glmnetLRC_fit)

# Extract the 'glmnet' object from the glmnetLRC fit
glmnetObject <- extract(glmnetLRC_fit)

# See how the glmnet methods operate on the object
plot(glmnetObject)

# Look at the coefficients for the optimal lambda
coef(glmnetObject, s = glmnetLRC_fit$optimalParms["lambda"] )

# Load the new observations
data(testdata)

# Use the trained model to make predictions about
# new observations for the response variable.
new <- predict(glmnetLRC_fit, testdata, truthCol = "Curated_Quality", keepCols = 1:2)
head(new)

# Now summarize the performance of the model
summary(new)

# And plot the probability predictions of the model
plot(new, scale = 0.5, legendArgs = list(x = "topright"))

# If predictions are made without an indication of the ground truth,
# the summary is necessarily simpler:
summary(predict(glmnetLRC_fit, testdata))