dglars | R Documentation |
dglars
function is used to estimate the solution curve defined by dgLARS method.
dglars(formula, family = gaussian, g, unpenalized,
b_wght, data, subset, contrasts = NULL, control = list())
dglars.fit(X, y, family = gaussian, g, unpenalized,
b_wght, control = list())
formula |
an object of class “ |
family |
a description of the error distribution and link
function used to specify the model. This can be a character string
naming a family function or the result of a call to a family function
(see |
g |
argument available only for |
unpenalized |
a vector used to specify the unpenalized estimators;
|
b_wght |
a |
data |
an optional data frame, list or environment (or object coercible by ‘as.data.frame’ to a data frame) containing the variables in the model. If not found in ‘data’, the variables are taken from ‘environment(formula)’. |
subset |
an optional vector specifying a subset of observations to be used in the fitting process. |
contrasts |
an optional list. See the ‘contrasts.arg’ of ‘model.matrix.default’. |
control |
a list of control parameters. See ‘Details’. |
X |
design matrix of dimension |
y |
response vector. When the |
dglars
function implements the differential geometric generalization
of the least angle regression method (Efron et al., 2004) proposed in
Augugliaro et al. (2013) and Pazira et al. (2017).
As in “glm
”, the user can specify family and link function using
the argument family
. When the binomial
family is used, the responce
can be a vector with entries 0/1 (failure/success) or, alternatively, a matrix where
the first column is the number of “successes” and the second column is the number
of “failures”. Starting with the version 2.0.0, the model can be specified combining
family and link functions as describted in the following table:
Family | Link |
gaussian | ‘identity ’, ‘log ’ and ‘inverse ’ |
binomial | ‘logit ’, ‘probit ’, ‘cauchit ’, ‘log ’ and ‘cloglog ’ |
poisson | ‘log ’, ‘identity ’, and ‘sqrt ’ |
Gamma | ‘inverse ’, ‘identity ’ and ‘log ’ |
inverse.gaussian | ‘1/mu^2 ’, ‘inverse ’, ‘identity ’, and ‘log ’
|
The R
code for binomial, Gamma and inverse gaussian families is due to
Hassan Pazira while the fortran version is due to Luigi Augugliaro.
dglars.fit
is a workhorse function: it is more efficient when the design
matrix does not require manipulations. For this reason we suggest to use this function
when the dgLARS method is applied in a high-dimensional setting, i.e., when p>n
.
When gaussian, gamma or inverse.gaussian is used to model the error distribution, dglars
returns the vector of the estimates of the dispersion parameter \phi
; by
default, the generalized Pearson statistic is used as estimator but the user can use
the function phihat
to specify other estimators (see phihat
for
more details).
The dgLARS solution curve can be estimated using two different algorithms, i.e. the
predictor-corrector method and the cyclic coordinate descent method (see below for
more details about the argument algorithm
). The first algorithm is
based on two steps. In the first step, called predictor step, an approximation of
the point that lies on the solution curve is computed. If the control parameter
dg_max
is equal to zero, in this step it is also computed an approximation
of the optimal step size using a generalization of the method proposed in Efron
et al. (2004). The optimal step size is defined as the reduction of the tuning parameter,
denoted by d\gamma
, such that at \gamma-d\gamma
there is a change in the
active set. In the second step, called corrector step, a Newton-Raphson algorithm is used to
correct the approximation previously computed. The main problem of this algorithm is that the
number of arithmetic operations required to compute the approximation scales as the cube
of the variables, this means that such algorithm is cumbersome in a high dimensional setting.
To overcome this problem, the second algorithm compute the dgLARS solution curve using an
adaptive version of the cyclic coordinate descent method proposed in Friedman et al. (2010).
The argument control
is a list that can supply any of the following components:
algorithm
:a string specifying the algorithm used to
compute the solution curve. The predictor-corrector algorithm is used
when algorithm = ''pc''
(default), while the cyclic coordinate
descent method is used setting algorithm = ''ccd''
;
method
:a string by means of to specify the kind of solution curve.
If method = ''dgLASSO''
(default) the algorithm computes the solution
curve defined by the differential geometric generalization of the LASSO
estimator; otherwise (method = ''dgLARS''
) the differential geometric
generalization of the least angle regression method is used;
nv
:control parameter for the pc
algorithm. An integer value
between 1 and \min(n,p)
used to specify the maximum number of
variables in the final model. Default is nv = min(n - 1, p)
;
np
:control parameter for the pc/ccd
algorithm. A non negative
integer used to define the maximum number of solution points. For the predictor-corrector
algorithm np
is set to 50\times\min(n - 1, p)
(default); for
the cyclic coordinate descent method, if g
is not specified, this argument is
set equal to 100 (default);
g0
:control parameter for the pc/ccd
algorithm. This parameter is
used to set the smallest value for the tuning parameter \gamma
. Default is
g0 = ifelse(p < n, 1.0e-04, 0.05)
; this argument is not required when g
is
used with the cyclic coordinate descent algorithm;
dg_max
:control parameter for the pc
algorithm. A non negative
value used to specify the largest value for the step size. Setting dg_max = 0
(default) the predictor-corrector algorithm computes an approximation of the optimal
step size (see Augugliaro et al. (2013) for more details);
nNR
:control criterion parameter for the pc
algorithm. A non
negative integer used to specify the maximum number of iterations of the Newton-Raphson
algorithm. Default is nNR = 50
;
NReps
:control parameter for the pc
algorithm. A non negative
value used to define the convergence of the Newton-Raphson algorithm. Default is
NReps = 1.0e-06
;
ncrct
:control parameter for the pc
algorithm. When the Newton-Raphson
algorithm does not converge, the step size (d\gamma
) is reduced by
d\gamma = cf \cdot d\gamma
and the corrector step is repeated. ncrct
is a non negative integer used to specify the maximum number of trials for the corrector step.
Default is ncrct = 50
;
cf
:control parameter for the pc
algorithm. The contractor factor
is a real value belonging to the interval [0,1]
used to reduce the step size
as previously described. Default is cf = 0.5
;
nccd
:control parameter for the ccd
algorithm. A non negative integer
used to specify the maximum number for steps of the cyclic coordinate descent algorithm.
Default is 1.0e+05
.
eps
control parameter for the pc/ccd
algorithm. The meaning of
this parameter is related to the algorithm used to estimate the solution curve:
i.
if algorithm = ''pc''
then eps
is used
a.
to identify a variable that will be included in the active
set (absolute value of the corresponding Rao's score test
statistic belongs to
[\gamma - \code{eps}, \gamma + \code{eps}]
);
b.
to establish if the corrector step must be repeated;
c.
to define the convergence of the algorithm, i.e., the
actual value of the tuning parameter belongs to the interval
[\code{g0 - eps},\code{g0 + eps}]
;
ii.
if algorithm = ''ccd''
then eps
is used to define the
convergence for a single solution point, i.e., each inner coordinate-descent loop
continues until the maximum change in the Rao's score test statistic, after any
coefficient update, is less than eps
.
Default is eps = 1.0e-05.
dglars
returns an object with S3 class “dglars
”, i.e., a list containing the
following components:
call |
the call that produced this object; |
formula |
if the model is fitted by |
family |
a description of the error distribution used in the model; |
unpenalized |
the vector used to specify the unpenalized estimators; |
np |
the number of points of the dgLARS solution curve; |
beta |
the |
phi |
the |
ru |
the matrix of the Rao's score test statistics of the variables included in the final model. This component is reported only if the predictor-corrector algorithm is used to fit the model; |
dev |
the |
nnonzero |
the sequence of number of nonzero coefficients for each value of the
tuning parameter |
g |
the sequence of |
X |
the used design matrix; |
y |
the used response vector; |
w |
the vector of weights used to compute the adaptive dglars method; |
action |
a |
conv |
an integer value used to encode the warnings and the errors related to the algorithm used to fit the model. The values returned are:
|
control |
the list of control parameters used to compute the dgLARS solution curve. |
Luigi Augugliaro and Hassan Pazira
Maintainer: Luigi Augugliaro luigi.augugliaro@unipa.it
Augugliaro L., Mineo A.M. and Wit E.C. (2016) <doi:10.1093/biomet/asw023> A differential-geometric approach to generalized linear models with grouped predictors, Biometrika, Vol 103(3), 563-577.
Augugliaro L., Mineo A.M. and Wit E.C. (2014) <doi:10.18637/jss.v059.i08> dglars: An R Package to Estimate Sparse Generalized Linear Models, Journal of Statistical Software, Vol 59(8), 1-40. https://www.jstatsoft.org/v59/i08/.
Augugliaro L., Mineo A.M. and Wit E.C. (2013) <doi:10.1111/rssb.12000> dgLARS: a differential geometric approach to sparse generalized linear models, Journal of the Royal Statistical Society. Series B., Vol 75(3), 471-498.
Efron B., Hastie T., Johnstone I. and Tibshirani R. (2004) <doi:10.1214/009053604000000067> Least Angle Regression, The Annals of Statistics, Vol. 32(2), 407-499.
Friedman J., Hastie T. and Tibshirani R. (2010) <doi:10.18637/jss.v033.i01> Regularization Paths for Generalized Linear Models via Coordinate Descent, Journal of Statistical Software, Vol. 33(1), 1-22.
Pazira H., Augugliaro L. and Wit E.C. (2018) <doi:10.1007/s11222-017-9761-7> Extended di erential geometric LARS for high-dimensional GLMs with general dispersion parameter, Statistics and Computing, Vol. 28(4), 753-774.
coef.dglars
, phihat
, plot.dglars
, print.dglars
and summary.dglars
methods.
set.seed(123)
#############################
# y ~ Binomial
n <- 100
p <- 100
X <- matrix(rnorm(n * p), n, p)
eta <- 1 + 2 * X[,1]
mu <- binomial()$linkinv(eta)
y <- rbinom(n, 1, mu)
fit <- dglars(y ~ X, family = binomial)
fit
# adaptive dglars method
b_wght <- coef(fit)$beta[, 20]
fit <- dglars(y ~ X, family = binomial, b_wght = b_wght)
fit
# the first three coefficients are not penalized
fit <- dglars(y ~ X, family = binomial, unpenalized = 1:3)
fit
# 'probit' link function
fit <- dglars(y ~ X, family = binomial("probit"))
fit
############################
# y ~ Poisson
n <- 100
p <- 100
X <- matrix(rnorm(n * p), n, p)
eta <- 2 + 2 * X[,1]
mu <- poisson()$linkinv(eta)
y <- rpois(n, mu)
fit <- dglars(y ~ X, family = poisson)
fit
############################
# y ~ Gamma
n <- 100
p <- 100
X <- matrix(abs(rnorm(n*p)),n,p)
eta <- 1 + 2 * X[,1]
mu <- drop(Gamma()$linkinv(eta))
shape <- 0.5
phi <- 1 / shape
y <- rgamma(n, scale = mu / shape, shape = shape)
fit <- dglars(y ~ X, Gamma("log"))
fit
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.