selection: Selecting a subset of 'q' variables

Description Usage Arguments Value Author(s) Examples

Description

Main function for selecting the best subset of q variables. Note that the selection procedure can be used with lm, glm or gam functions.

Usage

1
2
3
selection(x, y, q, prevar = NULL, criterion = "deviance", method = "lm",
  family = "gaussian", seconds = FALSE, nmodels = 1, nfolds = 5,
  cluster = TRUE, ncores = NULL)

Arguments

x

A data frame containing all the covariates.

y

A vector with the response values.

q

An integer specifying the size of the subset of variables to be selected.

prevar

A vector containing the number of the best subset of q-1 variables. NULL, by default.

criterion

The information criterion to be used. Default is the deviance. Other functions provided are the coefficient of determination ("R2"), the residual variance ("variance"), the Akaike information criterion ("aic"), AIC with a correction for finite sample sizes ("aicc") and the Bayesian information criterion ("bic"). The deviance, coefficient of determination and variance are calculated by cross-validation.

method

A character string specifying which regression method is used, i.e., linear models ("lm"), generalized additive models ("glm") or generalized additive models ("gam").

family

A description of the error distribution and link function to be used in the model: ("gaussian"), ("binomial") or ("poisson").

seconds

A logical value. By default, FALSE. If TRUE then, rather than returning the single best model only, the function returns a few of the best models (equivalent).

nmodels

Number of secondary models to be returned.

nfolds

Number of folds for the cross-validation procedure, for deviance, R2 or variance criterion.

cluster

A logical value. If TRUE (default), the procedure is parallelized. Note that there are cases without enough repetitions (e.g., a low number of initial variables) that R will gain in performance through serial computation. R takes time to distribute tasks across the processors also it will need time for binding them all together later on. Therefore, if the time for distributing and gathering pieces together is greater than the time need for single-thread computing, it does not worth parallelize.

ncores

An integer value specifying the number of cores to be used in the parallelized procedure. If NULL (default), the number of cores to be used is equal to the number of cores of the machine - 1.

Value

Best model

The best model. If seconds=TRUE, it returns also the best alternative models.

Variable name

Names of the variable.

Variable number

Number of the variables.

Information criterion

Information criterion used and its value.

Prediction

The prediction of the best model.

Author(s)

Marta Sestelo, Nora M. Villanueva and Javier Roca-Pardinas.

Examples

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
library(FWDselect)
data(diabetes)
x = diabetes[ ,2:11]
y = diabetes[ ,1]
obj1 = selection(x, y, q = 1, method = "lm", criterion = "variance", cluster = FALSE)
obj1

# second models
obj11 = selection(x, y, q = 1, method = "lm", criterion = "variance",
seconds = TRUE, nmodels = 2, cluster = FALSE)
obj11

# prevar argument
obj2 = selection(x, y, q = 2, method = "lm", criterion = "variance", cluster = FALSE)
obj2
obj3 = selection(x, y, q = 3, prevar = obj2$Variable_numbers,
method = "lm", criterion = "variance", cluster = FALSE)

FWDselect documentation built on May 2, 2019, 1:21 p.m.