bestsetNoise: Best Subset Selection Applied to Noise

bestsetNoiseR Documentation

Best Subset Selection Applied to Noise

Description

Best subset selection applied to completely random noise. This function demonstrates how variable selection techniques in regression can often err in including explanatory variables that are indistinguishable from noise.

Usage

bestsetNoise(m = 100, n = 40, method = "exhaustive", nvmax = 3,
              X = NULL, y=NULL, intercept=TRUE,
              print.summary = TRUE, really.big = FALSE, ...)

bestset.noise(m = 100, n = 40, method = "exhaustive", nvmax = 3,
              X = NULL, y=NULL, intercept=TRUE,
              print.summary = TRUE, really.big = FALSE, ...)

bsnCV(m = 100, n = 40, method = "exhaustive", nvmax = 3,
              X = NULL, y=NULL, intercept=TRUE, nfolds = 2,
              print.summary = TRUE, really.big = FALSE)

bsnOpt(X = matrix(rnorm(25 * 10), ncol = 10), y = NULL, method = "exhaustive",
              nvmax = NULL, nbest = 1, intercept = TRUE, criterion = "cp",
              tcrit = NULL, print.summary = TRUE, really.big = FALSE,
         ...)

bsnVaryNvar(m = 100, nvar = nvmax:50, nvmax = 3, method = "exhaustive",
              intercept=TRUE,
              plotit = TRUE, xlab = "# of variables from which to select",
              ylab = "p-values for t-statistics", main = paste("Select 'best'",
                                                  nvmax, "variables"),
              details = FALSE, really.big = TRUE, smooth = TRUE, ...)

Arguments

m

the number of observations to be simulated, ignored if X is supplied.

n

the number of predictor variables in the simulated model, ignored if X is supplied.

method

Use exhaustive search, or backward selection, or forward selection, or sequential replacement.

nvmax

Number of explanatory variables in model.

X

Use columns from this matrix. Alternatively, X may be a data frame, in which case a model matrix will be formed from it. If not NULL, m and n are ignored.

y

If not supplied, random normal noise will be generated.

nbest

Number of models, for each choice of number of columns of explanatory variables, to return (bsnOpt). If tcrit is non-NULL, it may be important to set this greater than one, in order to have a good chance of finding models with minimum absolute t-statistic greater than tcrit.

intercept

Should an intercept be added?

nvar

range of number of candidate variables (bsnVaryVvar).

nfolds

For splitting the data into training and text sets, the number of folds.

criterion

Criterion to use in choosing between models with different numbers of explanatory variables (bsnOpt). Alternatives are “bic”, or “cip” or “adjr2”.

tcrit

Consider only those models for which the minimum absolute t-statistic is greater than tcrit.

print.summary

Should summary information be printed.

plotit

Plot a graph? (bsnVaryVvar)

xlab

x-label for graph (bsnVaryVvar)

ylab

y-label for graph (bsnVaryVvar.)

main

main title for graph (bsnVaryVvar.)

details

Return detailed output list (bsnVaryVvar)

really.big

Set to TRUE to allow (currently) for more than 50 explanatory variables.

smooth

Fit smooth to graph? (bsnVaryVvar).

...

Additional arguments, to be passed through to regsubsets().

Details

If X is not supplied, and in any case for bsnVaryNvar, a set of n predictor variables are simulated as independent standard normal, i.e. N(0,1), variates. Additionally a N(0,1) response variable is simulated. The function bsnOpt selects the ‘best’ model with nvmax or fewer explanatory variables, where the argument criterion specifies the criterion that will be used to choose between models with different numbers of explanatory columns. Other functions select the ‘best’ model with nvmax explanatory columns. In any case, the selection is made using the regsubsets() function from the leaps package. (The leaps package must be installed for this function to work.)

The function bsnCV splits the data (randomly) into nfolds (2 or more) parts. It puts each part aside in turn for use to fit the model (effectively, test data), with the remaining data used for selecting the variables that will be used for fitting. One model fit is returned for each of the nfolds parts.

The function bsnVaryVvar makes repeated calls to bestsetNoise

Value

bestsetNoise returns the lm model object for the "best" model with nvmax explanatory columns.

bsnCV returns as many models as there are folds.

bsnVaryVvar silently returns either (details=FALSE) a matrix that has p-values of the coefficients for the ‘best’ choice of model for each different number of candidate variables, or (details=TRUE) a list with elements:

coef

A matrix of sets of regression coefficients

SE

A matrix of standard errors

pval

A matrix of p-values

Matrices have one row for each choice of nvar. The statistics returned are for the ‘best’ model with nvmax explanatory variables.

bsnOpt silently returns a list with elements:

u1

‘best’ model (lm object) with nvmax or fewer columns of predictors. If tcrit is non-NULL, and there is no model for which all coefficients have t-statistics less than tcrit in absolute value, u1 will be NULL.

tcrit

For each model, the minimum of the absolute values of the t-statistics.

regsubsets_obj

The object returned by the call to regsubsets.

Note

These functions are primarily designed to demonstrate the biases that can be expected, relative to theoretical estimates of standard errors of parameters and other fitted model statistics, when there is prior selection of the columns that are to be included in the model. With the exception of bsnVaryNvar, they can also be used with an X and y for actual data. In that case, the p-values should be compared with those obtained from repeated use of the function where y is random noise, as a check on the extent of selection effects.

Author(s)

J.H. Maindonald

See Also

lm

Examples

leaps.out <- try(require(leaps, quietly=TRUE))
leaps.out.log <- is.logical(leaps.out)
if ((leaps.out.log==TRUE)&(leaps.out==TRUE)){
bestsetNoise(20,6) # `best' 3-variable regression for 20 simulated observations
                   # on 7 unrelated variables (including the response)
bsnCV(20,6) # `best' 3-variable regressions (one for each fold) for 20
                   # simulated observations on 7 unrelated variables
                   # (including the response)
bsnVaryNvar(m = 50, nvar = 3:6, nvmax = 3, method = "exhaustive",
            plotit=FALSE, details=TRUE)
bsnOpt()
}

DAAG documentation built on May 29, 2024, 9:13 a.m.