holdout: Computes indexes for holdout data split into training and...

View source: R/estimate.R

holdoutR Documentation

Computes indexes for holdout data split into training and test sets.

Description

Computes indexes for holdout data split into training and test sets.

Usage

holdout(y, ratio = 2/3, internalsplit = FALSE, mode = "stratified", iter = 1, 
               seed = NULL, window=10, increment=1)

Arguments

y

desired target: numeric vector; or factor – then a stratified holdout is applied (i.e. the proportions of the classes are the same for each set).

ratio

split ratio (in percentage – sets the training set size; or in total number of examples – sets the test set size).

internalsplit

if TRUE then the training data is further split into training and validation sets. The same ratio parameter is used for the internal split.

mode

sampling mode. Options are:

  • stratified – stratified randomized holdout if y is a factor; else it behaves as standard randomized holdout;

  • random – standard randomized holdout;

  • order – static mode, where the first examples are used for training and the later ones for testing (useful for time series data);

  • rolling – rolling window, also known as sliding window (e.g. useful for stock market prediction), similar to order except that window is the window size, iter is the rolling iteration and increment is the number of samples slided at each iteration. In each iteration, the training set size is fixed to window, while the test set size is equal to ratio except for the last iteration (where it may be smaller).

  • incremental – incremental retraining mode, also known as growing windows, similar to order except that window is the initial window size, iter is the incremental iteration and increment is the number of samples added at each iteration. In each iteration, the training set size grows (+increment), while the test set size is equal to ratio except for the last iteration (where it may be smaller).

iter

iteration of the incremental retraining mode (only used when mode="rolling" or "incremental", typically iter is set within a cycle, see the example below).

seed

if NULL then no seed is used and the current R randomness is assumed; else a fixed seed is adopted to generate local random sample sequences, returning always the same result for the same seed (local means that it does not affect the state of other random number generations called after this function, see example).

window

training window size (if mode="rolling") or initial training window size (if mode="incremental").

increment

number of samples added to the training window at each iteration (if mode="incremental" or mode="rolling").

Details

Computes indexes for holdout data split into training and test sets.

Value

A list with the components:

  • $tr – numeric vector with the training examples indexes;

  • $ts – numeric vector with the test examples indexes;

  • $itr – numeric vector with the internal training examples indexes;

  • $val – numeric vector with the internal validation examples indexes;

Author(s)

Paulo Cortez http://www3.dsi.uminho.pt/pcortez/

References

See fit.

See Also

fit, predict.fit, mining, mgraph, mmetric, savemining, Importance.

Examples

### simple examples:
# preserves order, last two elements go into test set
H=holdout(1:10,ratio=2,internal=TRUE,mode="order")
print(H)
# no seed or NULL returns different splits:
H=holdout(1:10,ratio=2/3,mode="random")
print(H)
H=holdout(1:10,ratio=2/3,mode="random",seed=NULL)
print(H)
# same seed returns identical split:
H=holdout(1:10,ratio=2/3,mode="random",seed=12345)
print(H)
H=holdout(1:10,ratio=2/3,mode="random",seed=12345)
print(H)

### classification example
## Not run: 
data(iris)
# random stratified holdout
H=holdout(iris$Species,ratio=2/3,mode="stratified") 
print(table(iris[H$tr,]$Species))
print(table(iris[H$ts,]$Species))
M=fit(Species~.,iris[H$tr,],model="rpart") # training data only
P=predict(M,iris[H$ts,]) # test data
print(mmetric(iris$Species[H$ts],P,"CONF"))

## End(Not run)

### regression example with incremental and rolling window holdout:
## Not run: 
ts=c(1,4,7,2,5,8,3,6,9,4,7,10,5,8,11,6,9)
d=CasesSeries(ts,c(1,2,3))
print(d) # with 14 examples
# incremental holdout example (growing window)
for(b in 1:4) # iterations
  {
   H=holdout(d$y,ratio=4,mode="incremental",iter=b,window=5,increment=2)
   M=fit(y~.,d[H$tr,],model="mlpe",search=2)
   P=predict(M,d[H$ts,])
   cat("batch :",b,"TR from:",H$tr[1],"to:",H$tr[length(H$tr)],"size:",length(H$tr),
       "TS from:",H$ts[1],"to:",H$ts[length(H$ts)],"size:",length(H$ts),
       "mae:",mmetric(d$y[H$ts],P,"MAE"),"\n")
  }
# rolling holdout example (sliding window)
for(b in 1:4) # iterations
  {
   H=holdout(d$y,ratio=4,mode="rolling",iter=b,window=5,increment=2)
   M=fit(y~.,d[H$tr,],model="mlpe",search=2)
   P=predict(M,d[H$ts,])
   cat("batch :",b,"TR from:",H$tr[1],"to:",H$tr[length(H$tr)],"size:",length(H$tr),
       "TS from:",H$ts[1],"to:",H$ts[length(H$ts)],"size:",length(H$ts),
       "mae:",mmetric(d$y[H$ts],P,"MAE"),"\n")
  }

## End(Not run)

### local seed simple example
## Not run: 
# seed is defined, same sequence for N1 and N2:
# s2 generation sequence is not affected by the holdout call
set.seed(1); s1=sample(1:10,3)
set.seed(1);
N1=holdout(1:10,seed=123) # local seed
N2=holdout(1:10,seed=123) # local seed
print(N1$tr)
print(N2$tr)
s2=sample(1:10,3)
cat("s1:",s1,"\n") 
cat("s2:",s2,"\n") # s2 is equal to s1

## End(Not run)


rminer documentation built on Oct. 29, 2024, 9:06 a.m.

Related to holdout in rminer...