cvdglars: Cross-Validation Method for dgLARS In dglars: Differential Geometric Least Angle Regression

Description

Uses the k-fold cross-validation deviance to estimate the solution point of the dgLARS solution curve.

Usage

 1 2 3 4 5 cvdglars(formula, family = gaussian, g, unpenalized, b_wght, data, subset, contrast = NULL, control = list()) cvdglars.fit(X, y, family = gaussian, g, unpenalized, b_wght, control = list()) 

Arguments

 formula an object of class “formula”: a symbolic description of the model to be fitted. When the binomial family is used, the responce can be a vector with entries 0/1 (failure/success) or, alternatively, a matrix where the first column is the number of “successes” and the second column is the number of “failures”. family a description of the error distribution and link function used to specify the model. This can be a character string naming a family function or the result of a call to a family function (see family for details). By default the gaussian family with identity link function is used. g argument available only for ccd algorithm. When the ccd algorithm is used to fit the dgLARS model, this argument can be used to specify the values of the tuning parameter. unpenalized a vector used to specify the unpenalized estimators; unpenalized can be a vector of integers or characters specifying the names of the predictors with unpenalized estimators. b_wght a vector, with length equal to the number of columns of the matrix X, used to compute the weights used in the adaptive dgLARS method. b_wght is used to specify the initial estimates of the parameter vector. data an optional data frame, list or environment (or object coercible by ‘as.data.frame’ to a data frame) containing the variables in the model. If not found in ‘data’, the variables are taken from ‘environment(formula)’. subset an optional vector specifying a subset of observations to be used in the fitting process. contrast an optional list. See the ‘contrasts.arg’ of ‘model.matrix.default’. control a list of control parameters. See ‘Details’. X design matrix of dimension n\times p. y response vector. When the binomial family is used, this argument can be a vector with entries 0 (failure) or 1 (success). Alternatively, the response can be a matrix where the first column is the number of “successes” and the second column is the number of “failures”.

Details

cvdglars function runs dglars nfold+1 times. The deviance is stored, and the average and its standard deviation over the folds are computed.

cvdglars.fit is the workhorse function: it is more efficient when the design matrix have already been calculated. For this reason we suggest to use this function when the dgLARS method is applied in a high-dimensional setting, i.e. when p>n.

The control argument is a list that can supply any of the following components:

algorithm:

a string specifying the algorithm used to compute the solution curve. The predictor-corrector algorithm is used when algorithm = ''pc'' (default), while the cyclic coordinate d escent method is used setting algorithm = ''ccd'';

method:

a string by means of to specify the kind of solution curve. If method = ''dgLASSO'' (default) the algorithm computes the solution curve defined by the differential geometric generalization of the LASSO estimator; otherwise, if method = ''dgLARS'', the differential geometric generalization of the least angle regression method is used;

nfold:

a non negative integer used to specify the number of folds. Although nfolds can be as large as the sample size (leave-one-out CV), it is not recommended for large datasets. Default is nfold = 10;

foldid

a n-dimensional vector of integers, between 1 and n, used to define the folds for the cross-validation. By default foldid is randomly generated;

ng:

number of values of the tuning parameter used to compute the cross-validation deviance. Default is ng = 100;

nv:

control parameter for the pc algorithm. An integer value belonging to the interval [1;min(n,p)] (default is nv = min(n-1,p)) used to specify the maximum number of variables included in the final model;

np:

control parameter for the pc/ccd algorithm. A non negative integer used to define the maximum number of points of the solution curve. For the predictor-corrector algorithm np is set to 50 \cdot min(n-1,p) (default), while for the cyclic coordinate descent method is set to 100 (default), i.e. the number of values of the tuning parameter g;

g0:

control parameter for the pc/ccd algorithm. Set the smallest value for the tuning parameter g. Default is g0 = ifelse(p<n, 1.0e-06, 0.05);

dg_max:

control parameter for the pc algorithm. A non negative value used to specify the maximum length of the step size. Setting dg_max = 0 (default) the predictor-corrector algorithm uses the optimal step size (see Augugliaro et al. (2013) for more details) to approximate the value of the tuning parameter corresponding to the inclusion/exclusion of a variable from the model;

nNR:

control parameter for the pc algorithm. A non negative integer used to specify the maximum number of iterations of the Newton-Raphson algorithm used in the corrector step. Default is nNR = 200;

NReps:

control parameter for the pc algorithm. A non negative value used to define the convergence criterion of the Newton-Raphson algorithm. Default is NReps = 1.0e-06;

ncrct:

control parameter for the pc algorithm. When the Newton-Raphson algorithm does not converge, the step size (dg) is reduced by dg = cf * dg and the corrector step is repeated. ncrct is a non negative integer used to specify the maximum number of trials for the corrector step. Default is ncrct = 50;

cf:

control parameter for the pc algorithm. The contractor factor is a real value belonging to the interval [0,1] used to reduce the step size as previously described. Default is cf = 0.5;

nccd:

control parameter for the ccd algorithm. A non negative integer used to specify the maximum number for steps of the cyclic coordinate descent algorithm. Default is 1.0e+05.

eps

control parameter for the pc/ccd algorithm. The meaning of this parameter is related to the algorithm used to estimate the solution curve:

i.

if algorithm = ''pc'' then eps is used

a.

to identify a variable that will be included in the active set (absolute value of the corresponding Rao's score test statistic belongs to [g - eps, g + eps]);

b.

to establish if the corrector step must be repeated;

c.

to define the convergence of the algorithm, i.e., the actual value of the tuning parameter belongs to the interval g0 - eps, g0 + eps;

ii.

if algorithm = ''ccd'' then eps is used to define the convergence for a single solution point, i.e., each inner coordinate-descent loop continues until the maximum change in the Rao's score test statistic, after any coefficient update, is less than eps.

Default is eps = 1.0e-05.

Value

cvdglars returns an object with S3 class “cvdglars”, i.e. a list containing the following components:

 call the call that produced this object; formula_cv if the model is fitted by cvdglars, the used formula is returned; family a description of the error distribution used in the model; var_cv a character vector with the name of variables selected by cross-validation; beta the vector of the coefficients estimated by cross-validation; phi the cross-validation estimate of the disperion parameter; dev_m a vector of length ng used to store the mean cross-validation deviance; dev_v a vector of length ng used to store the variance of the mean cross-validation deviance; g the value of the tuning parameter corresponding to the minimum of the cross-validation deviance; g0 the smallest value for the tuning parameter; g_max the value of the tuning parameter corresponding to the starting point of the dgLARS solution curve; X the used design matrix; y the used response vector; w the vector of weights used to compute the adaptive dglars method; conv an integer value used to encode the warnings and the errors related to the algorithm used to fit the dgLARS solution curve. The values returned are: 0convergence of the algorithm has been achieved, 1problems related with the predictor-corrector method: error in predictor step, 2problems related with the predictor-corrector method: error in corrector step, 3maximum number of iterations has been reached, 4error in dynamic allocation memory; control the list of control parameters used to compute the cross-validation deviance.

Author(s)

Luigi Augugliaro
Maintainer: Luigi Augugliaro [email protected]

References

Augugliaro L., Mineo A.M. and Wit E.C. (2014) dglars: An R Package to Estimate Sparse Generalized Linear Models, Journal of Statistical Software, Vol 59(8), 1-40. http://www.jstatsoft.org/v59/i08/.

Augugliaro L., Mineo A.M. and Wit E.C. (2013) dgLARS: a differential geometric approach to sparse generalized linear models, Journal of the Royal Statistical Society. Series B., Vol 75(3), 471-498.

Augugliaro L., Mineo A.M. and Wit E.C. (2012) Differential geometric LARS via cyclic coordinate descent method, in Proceeding of COMPSTAT 2012, pp. 67-79. Limassol, Cyprus.

coef.cvdglars, print.cvdglars, plot.cvdglars methods
  1 2 3 4 5 6 7 8 9 10 11 12 13 ########################### # Logistic regression model # y ~ Binomial set.seed(123) n <- 100 p <- 100 X <- matrix(rnorm(n * p), n, p) b <- 1:2 eta <- b[1] + X[, 1] * b[2] mu <- binomial()\$linkinv(eta) y <- rbinom(n, 1, mu) fit_cv <- cvdglars.fit(X, y, family = binomial) fit_cv