mcsSubset: All-Subsets Regression
In mcsSubset: MCS: Variable Subset Selection in Linear Regression

Description Usage Arguments Details Value References See Also Examples

All-subsets regression for ordinary linear models.

mcsSubset(object, ...)

## S3 method for class 'formula'
mcsSubset(formula, ..., lm = FALSE)

## S3 method for class 'lm'
mcsSubset(object, ..., penalty = 0)

## Default S3 method:
mcsSubset(object, y, include = NULL, exclude = NULL,
size = NULL, penalty = 0, tolerance = 0, pradius = NULL, nbest = 1,
..., .algo = "hbba")

`formula, object`	An object of class `lm`, `formula` or `matrix`.
`y`	The response variable.
`include, exclude`	Index vectors designating variables that are forced in or out of the model, respectively. The vectors may consist of (integer) indexes, (character) names, or (logical) bits selecting the desired columns. The integer indexes correspond to the position of the variables in the model matrix; the intercept, if any, has index `1`. By default, all variables are included.
`size`	Vector of subset sizes (not counting the intercept, if any). By default, the best subsets are computed for each subset size (as determined by way of `include` and `exclude`). Ignored if `penalty != 0`.
`penalty`	Penalty per parameter (see `AIC`). If `penalty == 0`, determine subsets with lowest RSS for each subset size; otherwise, determine subset(s) with overall lowest AIC.
`tolerance`	If `penalty == 0`, a numeric vector (expanded if necessary), where `tolerance[n]` is the tolerance employed for subsets of size `n`; otherwise, a single value indicating the overall tolerance.
`pradius`	Preordering radius.
`nbest`	Number of best subsets to report.
`...`	Ignored.
`lm`	If `true`, compute `lm` component.
`.algo`	Internal use.

The function mcsSubset computes all variable-subsets regression for ordinary linear models. The function is generic and provides various methods to conveniently specify the regressor and response variables. The standard formula interface (see lm) can be used, or the information can be extracted from an already fitted lm object. The regressor matrix and response variable can also be passed in directly.

By default (i.e. penalty == 0), the method computes the nbest best subset models for every subset size, where the "best" models are the models with the lowest residual sum of squares (RSS). The scope of the search can be limited to certain subset sizes by setting size. A tolerance vector (expanded if necessary) may be specified to speed up the algorithm, where tolerance[n] is the tolerance applied to subset models of size n.

Alternatively (penalty > 0), the overall (i.e. over all sizes) nbest best subset models may be computed according to an information criterion of the AIC family. A single tolerance value may be specified to speed up the search.

By way of include and exclude, variables may be forced into or out of the regression, respectively.

The function will preorder the variables to reduce execution time if pradius > 0. Good execution times are usually attained for approximately pradius = n/3 (default value), where n is the number of regressors after evaluation include and exclude.

A set of standard extractor functions for fitted model objects is available for objects of class "mcsSubset". See methods for more details.

An object of class "mcsSubset", i.e. a list with the following components:

`weights`	Weights.
`offset`	Offset.
`nobs`	Number of observations.
`nvar`	Number of variables (not including intercept, if any).
`x.names`	Names of all design variables.
`y.name`	Name of response variable.
`include`	Indexes of variables forced in.
`exclude`	Indexes of variables forced out.
`intercept`	`TRUE` if regression has an intercept term; `FALSE` otherwise.
`penalty`	AIC penalty.
`nbest`	Number of best subsets.

When penalty == 0:

`size`	Subset sizes.
`tolerance`	Tolerance vector.
`rss`	A two dimensional numeric `nbest x nvar` array.
`which`	A three dimensional logical `nvar x nbest x nvar` array.

The entry rss[i, n] corresponds to the RSS of the i-th best subset model of size n. The entry which[j, i, n] has value TRUE if the i-th best subset model of size n contains the j-th variable.

When penalty != 0:

`tolerance`	Tolerance value.
`rss`	A one dimensional numeric array of length `nbest`.
`aic`	A one dimensional numeric array of length `nbest`.
`which`	A two dimensional logical `nvar x nbest` array.

The entries rss[i] and aic[i] correspond to the RSS and AIC of the i-th best subset model, respectively. The entry which[j, i] is TRUE if the i-th best subset model contains variable j.

Hofmann, M. and Gatu, C. and Kontoghiorghes, E. J. (2007). Efficient Algorithms for Computing the Best Subset Regression Models for Large-Scale Problems. Computational Statistics \& Data Analysis, 52, 16–29.

Gatu, C. and Kontoghiorghes, E. J. (2006). Branch-and-Bound Algorithms for Computing the Best Subset Regression Models. Journal of Computational and Graphical Statistics, 15, 139–156.

summary, methods.

## load data (with logs for relative potentials)
data("AirPollution", package = "mcsSubset")

#################
## basic usage ##
#################

## canonical example: fit best subsets
xs <- mcsSubset(mortality ~ ., data = AirPollution)

## visualize RSS
plot(xs)

## summarize
summary(xs)

## plot summary
plot(summary(xs))

## forced inclusion/exclusion of variables
xs <- mcsSubset(mortality ~ ., data = AirPollution,
                include = "noncauc", exclude = "whitecollar")

## or equivalently
xs <- mcsSubset(mortality ~ ., data = AirPollution,
                include = 10, exclude = 11)
summary(xs)

##########################
## find best BIC models ##
##########################

## find 10 best subset models
xs <- mcsSubset(mortality ~ ., data = AirPollution,
                penalty = "BIC", nbest = 10)

## summarize
summary(xs)

## visualize BIC and RSS
plot(summary(xs))