hdlm: Fitting High Dimensional Linear Models

Description Usage Arguments Details Value Future Additions Note Author(s) References See Also Examples

Description

hdlm is used to fit high dimensional linear models when the number of predictors is greater than (or close to the same order of magnitude as) the number of observations.

Usage

1
2
3
4
5
hdlm(formula, data, subset,
      method=c('mc+',  'root-lasso',  'scad', 'lasso','dantzig'),
      p.value.method = c('one-split', 'bootstrap', 'none'),
      model = TRUE, x = FALSE, y = FALSE, N = NULL, C = NULL,
      sigma.hat = NULL, ...)

Arguments

formula

an object of class "formula" (or one that can be coerced to that class): a symbolic description of the model to be fitted. The details of model specification are given under ‘Details’.

data

an optional data frame, list or environment (or object coercible by as.data.frame to a data frame) containing the variables in the model. If not found in data, the variables are taken from environment(formula), typically the environment from which lm is called.

subset

an optional vector specifying a subset of observations to be used in the fitting process.

method

method for performing the optimization. See below for details.

p.value.method

method for performing the calculating p-value. The default splits the data into two equal sets, uses 'method' for model selection on one half and the other half conducts ordinary least squares if reduced model is small enough (<= p) for it to run. Otherwise, users can choose to bootstrap p-values or not calculate them.

model, x, y

logicals. If TRUE the corresponding components of the fit (the model frame, the model matrix, the response, the QR decomposition) are returned.

N

Number of bootstrap replicates to run in order to estimate standard error and bias. If set to NULL (default) bootstrap estimates will not be calculated. Notice that if p.value.method is selected as bootstrap and N is NULL, no bootstrap standard errors will be produced.

C

Size of subset used to determine an upper bound on the noise level. If set to NULL, a non-positive integer, or an integer greater than n, C is assumed to be zero. Not used for root-lasso, which does not need to estimate the noise.

sigma.hat

Allows user to supply estimate of the noise variance. Overrides option for C above. Is ignored when method='root-lasso', as the noise level is not needed for the square root lasso method.

...

additional arguments to be passed to the low level regression fitting functions (see below).

Details

Models for hdlm are specified symbolically. A typical model has the form response ~ terms where response is the (numeric) response vector and terms is a series of terms which specifies a linear predictor for response. A terms specification of the form first + second indicates all the terms in first together with all the terms in second with duplicates removed. A specification of the form first:second indicates the set of terms obtained by taking the interactions of all terms in first with all terms in second. The specification first*second indicates the cross of first and second. This is the same as first + second + first:second.

If the formula includes an offset, this is evaluated and subtracted from the response.

See model.matrix for some further details. The terms in the formula will be re-ordered so that main effects come first, followed by the interactions, all second-order, all third-order and so on: to avoid this pass a terms object as the formula (see aov and demo(glm.vr) for an example).

A formula has an implied intercept term. To remove this use either y ~ x - 1 or y ~ 0 + x. See formula for more details of allowed formulae. Note that the intercept term will be penalized along with other terms; for this reason it is recommended that an intercept be excluded from the model.

Each of five current methods for running high dimensional regression have references below. The options 'mc+' and 'scad' call lower level routines from the package plus. Similarly, 'lasso' calls a function from the package lars. The code for the 'root-lasso' method was adapted from matlab code created by Alex Belloni. For 'dantzig' the quantreg package is utilized as suggested by Roger Koenker.

Value

hdlm returns an object of class "hdlm".

The function summary are used to obtain and print a summary and analysis of variance table of the results. The generic accessor functions coefficients, effects, fitted.values and residuals extract various useful features of the value returned by hdlm.

Future Additions

In the future, we hope to focus on optimizing the code for large datasets (an obviously common situation in high-dimensional linear models). This will include using parallel computing for the bootstrap method, as well as being more strategic by calculating various quantities (the gram matrix X'X, for instance) no more than necessary.

Additionally, we hope to include the p-value methodology by Meinshaussen, Meier, and Buhlmann, which while more computationally demanding provides a better estimate than the single split method.

Note

The reported bootstrap bias and standard errors give the estimated bias and errors for the estimator, NOT for the underlying parameters. A distinction is necessary since all of the estimators are biased. The method 'mc+' is chosen as a default method because it is 'nearly un-biased', and hence the distinction is negligible in most situations. If the user is only interested in point estimates and not standard errors, the 'root-lasso' method will generally be preferred as it does not require explicit estimation of the noise variance. The Dantzig selector benefits from the fastest execution time amongst our set of methods, and theory suggests that it is consistent even in the case of very high colinearity in the model matrix.

This package focuses on methods which both (i) produce sparse estimates and (ii) result from a global optimization problem. Users who do not require sparse estimates are directed to other methods such as ridge regression, elastic net, and the Bayesian lasso. For algorithms which do not fall under point (ii), popular choices are forward stagewise regression and OMP.

Author(s)

Created by Taylor B. Arnold for point estimation and confidence intervals in high-dimensional regression. The design of the function was inspired by the S/R function lm described in Chambers (1992).

References

Belloni, A., V. Chernozhukov, and L. Wang. (2011) "Square-Root Lasso: Pivotal Recovery of Sparse Signals Via Conic Programming". In: Arxiv preprint arXiv:1009.5689.

Bickel, P.J., Y. Ritov, and A.B. Tsybakov (2009) "Simultaneous analysis of Lasso and Dantzig selector". The Annals of Statistics 37.4, pp. 1705–1732.

Buhlmann, P. and S. Van De Geer (2011) Statistics for High-Dimensional Data: Methods, Theory and Applications. Springer-Verlag New York Inc.

Candes, E. and T. Tao (2007) "The Dantzig selector: Statistical estimation when p is much larger than n". The Annals of Statistics 35.6, pp. 2313–2351.

Chambers, J. M. (1992) Linear models. Chapter 4 of Statistical Models in S eds J. M. Chambers and T. J. Hastie, Wadsworth & Brooks/Cole.

Efron, Hastie, Johnstone and Tibshirani (2003) "Least Angle Regression" (with discussion) Annals of Statistics; see also http://www-stat.stanford.edu/~hastie/Papers/LARS/LeastAngle_2002.pdf.

Fan, J., Y. Feng, and Y. Wu (2009) "Network exploration via the adaptive LASSO and SCAD penalties". Annals of Applied Statistics 3.2, pp. 521–541.

Hastie, Tibshirani and Friedman (2002) Elements of Statistical Learning, Springer, NY.

Wasserman, L., and Roeder, K. (2009), "High Dimensional Variable Selection," The Annals of Statistics, 37, 2178–2201.

Zhang, C.H. (2010) "Nearly unbiased variable selection under minimax concave penalty". The Annals of Statistics 38.2, pp. 894–942

See Also

lm for fitting traditional low dimensional models.

lars, dselector, plus, and rootlasso for lower level functions for the various methods.

Examples

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
set.seed(1)
x <- matrix(rnorm(40*30), ncol=40, nrow=30)
y <- x[,1] + x[,2] + rnorm(30,sd=0.5)

out <- hdlm(y ~ x - 1, N = 100, method='lasso')
summary(out)

out <- hdlm(y ~ x - 1, method='root-lasso')
summary(out)

out <- hdlm(y ~ x - 1, N = 100, method='mc+')
summary(out)

hdlm documentation built on May 2, 2019, 6:14 p.m.

Related to hdlm in hdlm...