lassosig-methods: High dimensional significance test for LASSO using the...

Description Usage Arguments Details Value Author(s) References Examples


Provides p-values for lasso regression. This method implements the multi-sample splitting method for significance testing in a high-dimensional regression context. The basic idea is to split a sample in two, perform variable selection using LASSO on one half and derive p-values using ordinary least squares (OLS) on the other half.

Note that the penalisation parameter λ and s are identical - they are named this way for consistency with the package glmnet.

This method is only implemented for a single response variable, since general lasso regression requires the same set of parameters to be selected for every response variable, which is overly restrictive in some cases.


LassoSig(y, X, B = 100, s = c("lambda.min", "lambda.1se", "halfway",
  "usefixed"), gamma.min = 0.05, fixedP = NULL, include = NULL,
  nfolds = 10, intercept = TRUE)



Response vector


Design matrix


Number of times to partition the sample


The value of lambda to use in lasso. Can be:

  • An actual value of lambda you have determined

  • lambda.min The lambda that minimises the mean squared error from n-fold cross-validation (recommended)

  • lambda.1se The lambda that accounts for the smallest number of parameters and is within 1 standard error from lambda.min

  • halfway The lambda that is halfway between lambda.min and lambda.1se

  • usefixed Use the smallest lambda that accounts for a given number of parameters (set by fixedP)


Set of predictors to be force-included in OLS analysis


Lower bound for gamma in the adaptive search for the best p-value (default 0.05)


The fixed number of parameters to use (if s == "usefixed")


Number of folds of cross-validation in the glmnet n-fold crossvalidation


Whether to include an intercept in the OLS regression (default = TRUE)


The method works by partitioning the dataset randomly in two halves. Lasso regression is performed on one half, and using a particular value of the penalisation parameter lambda then a subset of the predictor variables are chosen. Ordinary least squares regression is then performed on the other half of the data. If S variables are chosen for a given split, then the p-values are Bonferroni moderated to S.p. The p-values of all variables not selected for a given split are then set to 1. This process is repeated B times, and subsequently B sets of p-values are generated. These p-values are then aggregated across splits to provide a given p-value for each predictor variable. For full details see the original paper. The aggregation requires an extra parameter γ_min, which is recommended to be 0.05 (and set by the parameter gamma.min).

Care must be taken with regards to the number of measurements (length(y)) and the number of folds for cross-validation (nfolds). The package glmnet requres at least 3 samples in a cross-validation split in finding the optimal λ (not the same as a multi-sample split). Therefore, if we start with N samples, glmnet receives at least floor(N/2) which it then splits in to nfolds for cross validation. As such we necessarily need

floor((floor(N/2))/nfolds) > 3

which is safely satisfied provided N > 6*nfolds + 3

When choosing B, there is a trade-off between bias and efficiency. A larger B will lead to a less biased result (i.e. less sensitive to the random sampling of folds) but can require significantly more computation time. A heuristically 'good' value is B=50.

The choice of λ = s is detailed in the glmnet package. For standard problems the best choice may be lambda.min, though if you are specifically trying to minimise the number of parameters necessary, lambda.1se (a one, not an L) is a good choice. Alternatively it may be advantageous to select a fixed number of parameters on every split. This can be performed by setting s="usefixed" and fixedP to the desired number of parameters.

Occasionally it is necessary to force the inclusion of predictors into the OLS significance testing. These can be included by setting include to the numeric indices (i.e. the column numbers) of the predictors to force-include.

Note that force exclusion of an intercept in OLS (by setting intercept = FALSE) can seriously bias results - only do this if you are sure at x_i = 0 for all i that y = 0 and that all relationships are perfectly linear.


A vector of p-values, where the ith entry corresponds to the p-value for the predictor defined by the ith column of X.


Kieran Campbell [email protected]


Meinshausen, Nicolai, Lukas Meier, and Peter Buhlmann. "P-values for high-dimensional regression." Journal of the American Statistical Association 104.488 (2009).


## Not run: 
X <- matrix(rnorm(300),ncol=3)
colnames(X) <- paste("predictor",1:3,sep="")
y <- rnorm(100, X[,1] - 2*X[,3],1)
pvals <- LassoSig(y, X, B=50, s="lambda.min")

## Output:
## predictor1   predictor2   predictor3
## 4.567217e-07 1.000000e+00 1.187136e-17

## End(Not run)

kieranrcampbell/SpatialStats documentation built on May 18, 2017, 7:44 p.m.