glmnetLRC: Construct a lasso or elastic-net logistic regression...

Description Usage Arguments Details Value Methods (by generic) Author(s) References See Also Examples

Description

This function extends the glmnet and cv.glmnet functions from the glmnet package. It uses cross validation to identify optimal elastic-net parameters and a threshold parameter for binary classification, where optimality is defined by minimizing an arbitrary, user-specified discrete loss function.

Usage

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
glmnetLRC(truthLabels, predictors, lossMat = "0-1", lossWeight = rep(1,
  NROW(predictors)), alphaVec = seq(0, 1, by = 0.2), tauVec = seq(0.1, 0.9,
  by = 0.05), cvFolds = 5, cvReps = 100, stratify = FALSE,
  masterSeed = 1, nJobs = 1, estimateLoss = FALSE, verbose = FALSE, ...)

## S3 method for class 'glmnetLRC'
print(x, verbose = TRUE, ...)

## S3 method for class 'glmnetLRC'
plot(x, ...)

## S3 method for class 'glmnetLRC'
coef(object, tol = 1e-10, ...)

## S3 method for class 'glmnetLRC'
predict(object, newdata, truthCol = NULL,
  keepCols = NULL, ...)

## S3 method for class 'glmnetLRC'
missingpreds(object, newdata, ...)

## S3 method for class 'glmnetLRC'
extract(object, ...)

Arguments

truthLabels

A factor with two levels containing the true labels for each observation. If it is more desirable to correctly predict one of the two classes over the other, the second level of this factor should be the class you are most interested in predicting correctly.

predictors

A matrix whose columns are the explanatory regression variables. Note: factors are not currently supported. To include a factor variable with n levels, it must be represented as n-1 dummy variables in the matrix.

lossMat

Either the character string "0-1", indicating 0-1 loss, or a loss matrix of class lossMat, produced by lossMatrix, that specifies the penalties for classification errors.

lossWeight

A vector of non-negative weights used to calculate the expected loss. The default value is 1 for each observation.

alphaVec

A sequence in [0, 1] designating possible values for the elastic-net mixing parameter, α. A value of α = 1 is the lasso penalty, α = 0 is the ridge penalty. Refer to glmnet for further information.

tauVec

A sequence of τ threshold values in (0, 1) for the logistic regression classifier. For a new observation, if the predicted probability that the observation belongs to the second level of truthLabels exceeds tau, the observation is classified as belonging to the second level.

cvFolds

The number of cross validation folds. cvFolds = length(truthLabels) gives leave-one-out (L.O.O.) cross validation, in which case cvReps is set to 1 and stratify is set to FALSE.

cvReps

The number of cross validation replicates, i.e., the number of times to repeat the cross validation by randomly repartitioning the data into folds and estimating the tuning parameters. For L.O.O. cross validation, this argument is set to 1 as there can only be one possible partition of the data.

stratify

A logical indicating whether stratified sampling should be used to ensure that observations from both levels of truthLabels are proportionally present in the cross validation folds. In other words, stratification attempts to ensure there are sufficient observations of each level of truthLabels in each training set to fit the model. Stratification may be required for small or imbalanced data sets. Note that stratification is not performed for L.O.O (when cvFolds = length(truthLabels)).

masterSeed

The random seed used to generate unique (and repeatable) seeds for each cross validation replicate.

nJobs

The number of cores on the local host to use in parallelizing the training. Parallelization takes place at the cvReps level, i.e., if cvReps = 1, parallelizing would do no good, whereas if cvReps = 2, each cross validation replicate would be run separately in its own thread if nJobs = 2. Parallelization is executed using parLapplyW() from the Smisc package.

estimateLoss

A logical, set to TRUE to calculate the average loss estimated via cross validation using the optimized parameters (α, λ, τ) to fit the elastic net model for each cross validation fold. This can be computationally expensive, as it requires another cross validation pass through the same partitions of the data, but using only the optimal parameters to estimate the loss for each cross validation replicate.

verbose

For glmetLRC, a logical to turn on (or off) messages regarding the progress of the training algorithm. For the print method, if set to FALSE, it will suppress printing information about the glmnetLRC object and only invisibly return the results.

x

For the print and plot methods: an object of class glmnetLRC (returned by glmnetLRC()), which contains the optimally-trained elastic-net logistic regression classifier.

object

For the coef, predict, and extract methods: an object of class glmnetLRC (returned by glmnetLRC()) which contains the optimally-trained elastic-net logistic regression classifier.

tol

A small positive number, such that coefficients with an absolute value smaller than tol are not returned.

newdata

A dataframe or matrix containing the new set of observations to be predicted, as well as an optional column of true labels. newdata should contain all of the column names that were used to fit the elastic-net logistic regression classifier.

truthCol

The column number or column name in newdata that contains the true labels, which should be a factor (and this implies newdata should be a dataframe if truthCol is provided). Optional.

keepCols

A numeric vector of column numbers (or a character vector of column names) in newdata that will be 'kept' and returned with the predictions. Optional.

...

For glmnetLRC(), these are additional arguments to glmnet in the glmnet package. Certain arguments of glmnet are reserved by the glmnetLRC package and an error message will make that clear if they are used. In particular, arguments that control the behavior of α and λ are reserved. For the plot method, the "..." are additional arguments to the default S3 method pairs. And for the print, coef, predict, missingpreds, and extract methods, the "..." are ignored.

Details

For a given partition of the training data, cross validation is performed to estimate the optimal values of α (the mixing parameter of the ridge and lasso penalties) and λ (the regularization parameter), as well as the optimal threshold, τ, which is used to dichotomize the probability predictions of the elastic-net logistic regression model into binary outcomes. (Specifically, if the probability an observation belongs to the second level of truthLabels exceeds τ, it is classified as belonging to that second level). In this case, optimality is defined as the set of parameters that minimize the risk, or expected loss, where the loss function created using lossMatrix. The expected loss is calculated such that each observation in the data receives equal weight

glmnetLRC() searches for the optimal values of α and τ by fitting the elastic-net model at the points of the two-dimensional grid defined by alphaVec and tauVec. For each value of α, the vector of λ values is selected automatically by glmnet according to its default arguments. The expected loss is calculated for each (α,λ,τ) triple, and the triple giving rise to the lowest risk designates the optimal model for a given cross validation partition, or cross validation replicate, of the data.

This process is repeated cvReps times, where each time a different random partition of the data is created using its own seed, resulting in another 'optimal' estimate of (α,λ,τ). The final estimate of (α,λ,τ) is given by the respective medians of those estimates. The final elastic-net logistic regression classfier is given by fitting the regression coefficients to all the training data using the optimal (α,λ,τ).

The methodology is discussed in detail in the online package documentation.

Value

An object of class glmnetLRC, which inherits from classes lognet and glmnet. It contains the object returned by glmnet that has been fit to all the data using the optimal parameters (α, λ, τ). It also contains the following additional elements:

lossMat

The loss matrix used as the criteria for selecting optimal tuning parameters

parms

A data fame that contains the tuning parameter estimates for (α, λ, τ) that minimize the expected loss for each cross validation replicate. Used by the plot method.

optimalParms

A named vector that contains the final estimates of (α, λ, τ), calculated as the element-wise median of parms

lossEstimates

If estimateLoss = TRUE, this element is a data frame with the expected loss for each cross validation replicate

Methods (by generic)

Author(s)

Landon Sego, Alex Venzin

References

Amidan BG, Orton DJ, LaMarche BL, Monroe ME, Moore RJ, Venzin AM, Smith RD, Sego LH, Tardiff MF, Payne SH. 2014. Signatures for Mass Spectrometry Data Quality. Journal of Proteome Research. 13(4), 2215-2222. http://pubs.acs.org/doi/abs/10.1021/pr401143e

Friedman J, Hastie T, Tibshirani R. 2010. Regularization Paths for Generalized Linear Models via Coordinate Descent. Journal of Statistical Software. 33(1), 1-22.

See Also

summary.LRCpred, a summary method for objects of class LRCpred, produced by the predict method.

Examples

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
# Load the VOrbitrap Shewanella QC data from Amidan et al.
data(traindata)

# Here we select the predictor variables
predictors <- as.matrix(traindata[,9:96])

# The logistic regression model requires a binary response
# variable. We will create a factor variable from the
# Curated Quality measurements. Note how we put "poor" as the
# second level in the factor.  This is because the principal
# objective of the classifer is to detect "poor" datasets
response <- factor(traindata$Curated_Quality,
                   levels = c("good", "poor"),
                   labels = c("good", "poor"))

# Specify the loss matrix. The "poor" class is the target of interest.
# The penalty for misclassifying a "poor" item as "good" results in a
# loss of 5.
lM <- lossMatrix(c("good","good","poor","poor"),
                 c("good","poor","good","poor"),
                 c(     0,     1,     5,     0))

# Display the loss matrix
lM

# Train the elastic-net classifier (we don't run it here because it takes a long time)
## Not run: 
glmnetLRC_fit <- glmnetLRC(response, predictors, lossMat = lM, estimateLoss = TRUE,
                           nJobs = parallel::detectCores())

## End(Not run)

# We'll load the precalculated model fit instead
data(glmnetLRC_fit)

# Show the optimal parameter values
print(glmnetLRC_fit)

# Show the coefficients of the optimal model
coef(glmnetLRC_fit)

# Show the plot of all the optimal parameter values for each cross validation replicate
plot(glmnetLRC_fit)

# Extract the 'glmnet' object from the glmnetLRC fit
glmnetObject <- extract(glmnetLRC_fit)

# See how the glmnet methods operate on the object
plot(glmnetObject)

# Look at the coefficients for the optimal lambda
coef(glmnetObject, s = glmnetLRC_fit$optimalParms["lambda"] )

# Load the new observations
data(testdata)

# Use the trained model to make predictions about
# new observations for the response variable.
new <- predict(glmnetLRC_fit, testdata, truthCol = "Curated_Quality", keepCols = 1:2)
head(new)

# Now summarize the performance of the model
summary(new)

# And plot the probability predictions of the model
plot(new, scale = 0.5, legendArgs = list(x = "topright"))

# If predictions are made without an indication of the ground truth,
# the summary is necessarily simpler:
summary(predict(glmnetLRC_fit, testdata))

pnnl/glmnetLRC documentation built on May 25, 2019, 10:22 a.m.