Provides p-values for lasso regression. This method implements the multi-sample splitting method for significance testing in a high-dimensional regression context. The basic idea is to split a sample in two, perform variable selection using LASSO on one half and derive p-values using ordinary least squares (OLS) on the other half.
Note that the penalisation parameter λ and s are identical - they are named this way for consistency with the package glmnet.
This method is only implemented for a single response variable, since general lasso regression requires the same set of parameters to be selected for every response variable, which is overly restrictive in some cases.
1 2 3
Number of times to partition the sample
The value of lambda to use in lasso. Can be:
Set of predictors to be force-included in OLS analysis
Lower bound for gamma in the adaptive search for the best p-value (default 0.05)
The fixed number of parameters to use (if
Number of folds of cross-validation in the glmnet n-fold crossvalidation
Whether to include an intercept in the OLS regression (default =
The method works by partitioning the dataset randomly in two halves. Lasso regression is
performed on one half, and using a particular value of the penalisation parameter lambda
then a subset of the predictor variables are chosen. Ordinary least squares regression is
then performed on the other half of the data. If S variables are chosen for a given
split, then the p-values are Bonferroni moderated to S.p. The p-values of all variables
not selected for a given split are then set to 1. This process is repeated
B times, and
B sets of p-values are generated. These p-values are then aggregated across
splits to provide a given p-value for each predictor variable. For full details see the
original paper. The aggregation requires an extra parameter γ_min, which is recommended
to be 0.05 (and set by the parameter
Care must be taken with regards to the number of measurements (
the number of folds for cross-validation (
nfolds). The package
requres at least 3 samples in a cross-validation split in finding the optimal
λ (not the same as a multi-sample split). Therefore, if we start
with N samples,
glmnet receives at least floor(N/2) which it then splits
nfolds for cross validation. As such we necessarily need
floor((floor(N/2))/nfolds) > 3
which is safely satisfied provided N > 6*nfolds + 3
B, there is a trade-off between bias and efficiency. A larger
will lead to a less biased result (i.e. less sensitive to the random sampling of folds) but
can require significantly more computation time. A heuristically 'good' value is
The choice of λ =
s is detailed in the
glmnet package. For standard
problems the best choice may be
lambda.min, though if you are specifically trying to
minimise the number of parameters necessary,
lambda.1se (a one, not an L) is a good choice.
Alternatively it may be advantageous to select a fixed number of parameters on every split. This
can be performed by setting
fixedP to the desired number of
Occasionally it is necessary to force the inclusion of predictors into the OLS significance testing.
These can be included by setting
include to the numeric indices (i.e. the column numbers)
of the predictors to force-include.
Note that force exclusion of an intercept in OLS (by setting
intercept = FALSE) can seriously
bias results - only do this if you are sure at x_i = 0 for all i that y = 0 and that
all relationships are perfectly linear.
A vector of p-values, where the ith entry corresponds to the p-value for the predictor
defined by the ith column of
Kieran Campbell [email protected]
Meinshausen, Nicolai, Lukas Meier, and Peter Buhlmann. "P-values for high-dimensional regression." Journal of the American Statistical Association 104.488 (2009).
1 2 3 4 5 6 7 8 9 10 11 12
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.