mcsSubset: All-Subsets Regression

Description Usage Arguments Details Value References See Also Examples

Description

All-subsets regression for ordinary linear models.

Usage

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
mcsSubset(object, ...)

## S3 method for class 'formula'
mcsSubset(formula, ..., lm = FALSE)

## S3 method for class 'lm'
mcsSubset(object, ..., penalty = 0)

## Default S3 method:
mcsSubset(object, y, include = NULL, exclude = NULL,
size = NULL, penalty = 0, tolerance = 0, pradius = NULL, nbest = 1,
..., .algo = "hbba")

Arguments

formula, object

An object of class lm, formula or matrix.

y

The response variable.

include, exclude

Index vectors designating variables that are forced in or out of the model, respectively. The vectors may consist of (integer) indexes, (character) names, or (logical) bits selecting the desired columns. The integer indexes correspond to the position of the variables in the model matrix; the intercept, if any, has index 1. By default, all variables are included.

size

Vector of subset sizes (not counting the intercept, if any). By default, the best subsets are computed for each subset size (as determined by way of include and exclude). Ignored if penalty != 0.

penalty

Penalty per parameter (see AIC). If penalty == 0, determine subsets with lowest RSS for each subset size; otherwise, determine subset(s) with overall lowest AIC.

tolerance

If penalty == 0, a numeric vector (expanded if necessary), where tolerance[n] is the tolerance employed for subsets of size n; otherwise, a single value indicating the overall tolerance.

pradius

Preordering radius.

nbest

Number of best subsets to report.

...

Ignored.

lm

If true, compute lm component.

.algo

Internal use.

Details

The function mcsSubset computes all variable-subsets regression for ordinary linear models. The function is generic and provides various methods to conveniently specify the regressor and response variables. The standard formula interface (see lm) can be used, or the information can be extracted from an already fitted lm object. The regressor matrix and response variable can also be passed in directly.

By default (i.e. penalty == 0), the method computes the nbest best subset models for every subset size, where the "best" models are the models with the lowest residual sum of squares (RSS). The scope of the search can be limited to certain subset sizes by setting size. A tolerance vector (expanded if necessary) may be specified to speed up the algorithm, where tolerance[n] is the tolerance applied to subset models of size n.

Alternatively (penalty > 0), the overall (i.e. over all sizes) nbest best subset models may be computed according to an information criterion of the AIC family. A single tolerance value may be specified to speed up the search.

By way of include and exclude, variables may be forced into or out of the regression, respectively.

The function will preorder the variables to reduce execution time if pradius > 0. Good execution times are usually attained for approximately pradius = n/3 (default value), where n is the number of regressors after evaluation include and exclude.

A set of standard extractor functions for fitted model objects is available for objects of class "mcsSubset". See methods for more details.

Value

An object of class "mcsSubset", i.e. a list with the following components:

weights

Weights.

offset

Offset.

nobs

Number of observations.

nvar

Number of variables (not including intercept, if any).

x.names

Names of all design variables.

y.name

Name of response variable.

include

Indexes of variables forced in.

exclude

Indexes of variables forced out.

intercept

TRUE if regression has an intercept term; FALSE otherwise.

penalty

AIC penalty.

nbest

Number of best subsets.

When penalty == 0:

size

Subset sizes.

tolerance

Tolerance vector.

rss

A two dimensional numeric nbest x nvar array.

which

A three dimensional logical nvar x nbest x nvar array.

The entry rss[i, n] corresponds to the RSS of the i-th best subset model of size n. The entry which[j, i, n] has value TRUE if the i-th best subset model of size n contains the j-th variable.

When penalty != 0:

tolerance

Tolerance value.

rss

A one dimensional numeric array of length nbest.

aic

A one dimensional numeric array of length nbest.

which

A two dimensional logical nvar x nbest array.

The entries rss[i] and aic[i] correspond to the RSS and AIC of the i-th best subset model, respectively. The entry which[j, i] is TRUE if the i-th best subset model contains variable j.

References

Hofmann, M. and Gatu, C. and Kontoghiorghes, E. J. (2007). Efficient Algorithms for Computing the Best Subset Regression Models for Large-Scale Problems. Computational Statistics \& Data Analysis, 52, 16–29.

Gatu, C. and Kontoghiorghes, E. J. (2006). Branch-and-Bound Algorithms for Computing the Best Subset Regression Models. Journal of Computational and Graphical Statistics, 15, 139–156.

See Also

summary, methods.

Examples

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
## load data (with logs for relative potentials)
data("AirPollution", package = "mcsSubset")

#################
## basic usage ##
#################

## canonical example: fit best subsets
xs <- mcsSubset(mortality ~ ., data = AirPollution)

## visualize RSS
plot(xs)

## summarize
summary(xs)

## plot summary
plot(summary(xs))

## forced inclusion/exclusion of variables
xs <- mcsSubset(mortality ~ ., data = AirPollution,
                include = "noncauc", exclude = "whitecollar")

## or equivalently
xs <- mcsSubset(mortality ~ ., data = AirPollution,
                include = 10, exclude = 11)
summary(xs)

##########################
## find best BIC models ##
##########################

## find 10 best subset models
xs <- mcsSubset(mortality ~ ., data = AirPollution,
                penalty = "BIC", nbest = 10)

## summarize
summary(xs)

## visualize BIC and RSS
plot(summary(xs))

mcsSubset documentation built on May 2, 2019, 6:50 p.m.