gcv: Estimate EPE Using Delete-d Cross-Validation

Description Usage Arguments Details Value Note Author(s) References See Also Examples

Description

This is a general purpose function to estimate the EPE of a specified cost function in regression and classification problems. For regression, the default cost function is for mean-square error and for classification it is the misclassification rate. Direct support for elastic penalty regression, LASSO, PCR, PLSR, nearest neighbour and Random Forest regression are included in the package. And for classification, built-in support functions are provided for LDA, QDA, Naive Bayes, kNN, CART, C5.0, Random Forest and SVM. Examples included in vignette section are provided for SCAD, MCP and best subset regression. Illustrative example datasets and data generation models are also provided.

Usage

1
2
3
gcv(X, y, MaxIter = 1000, d = ceiling(length(y)/10), NCores = 1,
  cost = mse,  yhat = yhat_lm, libs = character(0), seed = "default",
  ...)

Arguments

X

inputs, matrix or dataframe

y

output vector

MaxIter

Number of iterations of the CV procedure

d

Number of observations for the hold-out sample

NCores

Default is 1 which does not use the parallel package. Otherwise, you can set to the number of cores available. If unsure, just experiment!

cost

Average cost. See examples mse, mae, mape.

yhat

In general it must be a function with arguments dfTrain and dfTest. See examples below.

libs

Required libraries needed for the predictor.

seed

Default is to use R's default which is based on the current time. Otherwise set to an integer value. See Details.

...

Additional arguments that are passed to yhat.

Details

If only serial evaluation was implemented then I would have used set.seed to control the random. But I have included it as an argument since it can be used to set the parallel random number generator seed. This is sometimes useful for replicating the simulations. If the argument seed is used, it will also set the seed when only serial computation is done.

Value

Matrix with one row and four columns: epe, sd_epe, snr, pcorr. These are respectively the estimated EPE, standard deviation of this estimate, an estimate of the snr (signal-to-noise ratio) out-of-sample and an out-of-sample estimate of the correlation between the prediction and the true value.

Note

The statistical distribution of the EPE's when the argument outAllQ is set to TRUE is often positively skewed. This may be of interest in applications.

Author(s)

A. I. McLeod

References

ESL

See Also

mse, mae, mape, misclassificationrate, logloss, yhat_lm, yhat_nn, yhat_lars, yhat_plus, yhat_gel, yhat_step, yh_lda, yh_qda, yh_svm, yh_NB, yh_RF, yh_CART, yh_C50, yh_kNN, featureSelect, cv.glm

Examples

1
2
3
#Simple example but in general, MaxIter >= 1000 is recommended.
Xy <- ShaoReg()
gcv(Xy[,1:8], Xy[,9], MaxIter=25, d=5)

gencve documentation built on May 2, 2019, 6:08 a.m.