Fits a path of Generalised Linear Models with LASSO (or L1) penalties, and finds the best model by corss-validation.

Share:

Description

Fits a sequence (path) of generalised linear models with LASSO penalties, using an iteratively reweighted local linearisation approach. The whole path of models is returned, as well as the one that minimises predictive log-likelihood on random test observations. Can handle negative binomial family, even with overdispersion parameter unknown, as well as other GLM families.

Usage

1
2
cv.glm1path(object, block = NULL, best="min", plot=TRUE, prop.test=0.2, n.split = 10,
    seed=NULL, show.progress=FALSE, ...)

Arguments

object

Output from a glm1path fit.

block

A factor specifying a blocking variable, where training/test splits randomly assign blocks of observations to different groups rather than breaking up observations within blocks. Default (NULL) will randomly split rows into test and training groups.

best

How should the best-fitting model be determined? "1se" uses the one standard error rule, "min" (or any other value) will return the model with best predictive performance. WARNING: David needs to check se calculatios...

plot

Logical value indicating whether to plot the predictive log-likelihood as a function of model complexity.

prop.test

The proportion of observations (or blocks) to assign as test observations. Default value of 0.2 gives a 80:20 training:test split.

n.split

The number of random training/test splits to use. Default is 10 but the more the merrier (and the slower).

seed

A vector of seeds to use for the random test/training splits. This is useful if you want to be able to exactly replicate analyses, without Monte Carlo variation in the splits. Default will not used fixed seeds.

show.progress

Logical argument, if TRUE, console will report when a run for a seed has been completed. This option has been included because this function can take yonks to run on large datasets.

...

Further arguments passed through to glm1path.

Details

This function fits a series of LASSO-penalised generalised linear models, with different values for the LASSO penalty, as for glm1path. The main difference is that the best fitting model is selected by cross-validation, using n.test different random training/test splits to estimate predictive performance on new (test) data. Mean predictive log-likelihood (per test observation) is used as the criterion for choosing the best model, which has connections with the Kullback-Leibler distance. The best argument controls whether to select the model that maximises predictive log-likelihood, or the smallest model within 1se of the maximum (the '1 standard error rule').

All other details of this function are as for glm1path.

Value

coefficients

Vector of model coefficients for the best-fitting model (as judged by predictive log-likelihood)

lambda

The value of the LASOS penalty parameter, lambda, for the best-fitting model (as judged by predictive log-likelihood)

glm1.best

The glm1 fit for the best-fitting model (as judged by predictive log-likelihood). For what this contains see glm1.

all.coefficients

A matrix where each column represents the model coefficients for a fit along the path specified by lambdas.

lambdas

A vector specifying the path of values for the LASSO penalty, arranged from largest (strongest penalty, smallest fitted model) to smallest (giving the largest fitted model).

logL

A vector of log-likelihood values for each model along the path.

df

A vector giving the number of non-zero parameter estimates (a crude measure of degrees of freedom) for each model along the path.

bics

A vector of BIC values for each model along the path. Calculated using a penalty on model complexity as specified by input argument k.

counter

A vector counting how many iterations until convergence, for each model along the path.

check

A vector of logical values specifying whether or not Karush-Kuhn-Tucker conditions are satisfied at the solution.

phis

For negative binomial regression - a vector of overdispersion parameters, for each model along the path.

y

The vector of values for the response variable specified as an input argument.

X

The design matrix of p explanatory variables specified as an input argument.

penalty

The vector to be multiplied by each lambda to make the penalty for each fitted model.

family

The family argument specified as input.

ll.cv

The mean predictive log-likelihood, averaged over all observations and then over all training/test splits.

se

Estimated standard error of the mean predictive log-likelihood.

Author(s)

David I. Warton <David.Warton@unsw.edu.au>

References

Osborne, M.R., Presnell, B. and Turlach, B.A. (2000) On the LASSO and its dual. Journal of Computational and Graphical Statistics, 9, 319-337.

See Also

glm1path, \codeglm1, glm, family

Examples

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
data(spider)
Alopacce <- spider$abund[,1]
X <- cbind(1,spider$x)

# fit a LASSO-penalised negative binomial regression:
ft = glm1path(Alopacce,X,lam.min=0.1)
coef(ft)

# now estimate the best-fitting model by cross-validation:
cvft = cv.glm1path(ft)
coef(cvft)

Want to suggest features or report bugs for rdrr.io? Use the GitHub issue tracker.