choicemod: Choice of the regression model

Description Usage Arguments Details Value See Also Examples

Description

This function estimates the mean squared error (MSE) of parametric, semi-parametric or non parametric regression models (including possibly covariates selection) using a repeated learning/test samples approach. The models are estimated with different methods (chosen by the user) for comparison purpose. The following methods (with and without variables selection) are available: multiple linear regression (linreg), sliced inverse regression associated with kernel regression (sir), random forests regression (rf), principal components regression (pcr), partial least squares regression (plsr), ridge regression (ridge). The procedure for covariates selection is the same for all the estimation methods and is based on variable importance (VI) obtained via repeated random perturbations of the covariates.

Usage

1
2
choicemod(X, Y, method = c("linreg", "sir", "rf"), N = 20,
  prop_train = 0.8, nperm = 50, cutoff = TRUE, nbsel = NULL)

Arguments

X

a numerical matrix containing the p variables in the model.

Y

a numerical response vector.

method

a vector with the names of the chosen regression methods ("linreg", "sir", "rf", "pcr", "plsr", "ridge").

N

the number of replications (the number of ramdom leaning/test samples) to estimate the MSE values.

prop_train

a value between 0 and 1 with the proportion of observations in the training samples.

nperm

the number of random permutations to perform the importance of the covariates (VI).

cutoff

if TRUE the covariates are selected automatically and the number of selected variables is unknown. If cutoff=FALSE the nbsel best variables are selected.

nbsel

the number of selected covariates. Active only if cutoff=FALSE.

Details

The only method with no parameter to tune is "linreg". The parameters of the methods sir, pcr, plsr and ridge are tuned on the training samples. The bandwidth for Kernel Regression Smoother is tuned by leave one out cross validation. The number of components for pcr and plsr is tuned as follows: for each possible number of components, the root mean square error (RMSE) is calculated via 5-fold cross validation and the number of components is selected by detecting a change point position (in mean and variance). The parameter mtry for random forests regression is not tuned and is fixed to p/3. The number of trees is not tuned and is fixed to ntree=300.

Value

An object with S3 class "choicemod" and the following components:

mse

a matrix of dimension N times length(methods) with the values of MSE calculated with the N test samples (in row) and each regression method (in column) estimating the reduced models (with covariate selection) on the training samples.

mse_all

a matrix of dimension N times length(methods) with the values of MSE calculated with the N test samples (in rows) and each regression method (in columns) estimating the complete models (no covariate selection) on the training samples.

sizemod

a matrix of dimension N times length(methods) with the number of covariates selected in the reduced model for each replication (in row) and each regresion method (in column).

pvarsel

a matrix of dimension p times length(methods) with the occurrences (in percent) of selection of each covariates (in row) and each regression method (in column).

See Also

boxplot.choicemod, barplot.choicemod, varimportance

Examples

1
2
3
4
5
6
7
data(simus)
X <- simus$X
Y <- simus$Y1
#res <- choicemod(X,Y,method=c("linreg","sir"), N = 50, nperm = 100)
#The computation time a bit long. So the results have been stored.
res <- simus$res1
boxplot(res)

chavent/modvarsel documentation built on May 22, 2019, 2:22 p.m.