cv_split_temporal: Special resampling strategy for K-fold cross-validation on...

Description Usage Arguments Details Value Examples

Description

Special resampling strategy for K-fold cross-validation on time series data with stratification by target variable.

Usage

1
2
cv_split_temporal(data, y, id, time, nfolds = 5L, probs = seq(0, 1,
  length.out = 11))

Arguments

data

data.table with y, id and time.

y

Target variable name (character).

id

Identifier of each time series (character).

time

Time variable name (character).

nfolds

Number of folds (min 2, max 20).

probs

Numeric vector of probabilities for quantile binning with values in [0, 1] range.

Details

Numeric target: quantile binning is used for stratification.

Character/categorical target: resampling performs within categories.

probs can be a vector like c(0, seq(0.99, 1, length.out = 10)) for target with very skewed distribution, e.g. for financial data with 99% of 0's.

When some observations from one time series fall into validation fold, train/validation indices for this time series will be reassigned: only last observation will be in validation fold. This ensures that training performs on past data and predictions are made for future observations.

TODO: allow to specify arbitrary number of observations for validation set.

Value

data.table with nfolds columns. Each column is an indicator variable with 1 corresponds to observations in validation dataset (stratified by target).

Examples

1
2
3
4
5
6
7
dt <- data.table(
    user = rep(1:100, each = 5),
    date = as.POSIXct(rep(seq(1.8*10e8, 1.8*10e8 + 388800, by = 86400), 100),
                      origin = "1960-01-01"),
    target = rnorm(5e2)
)
cv_split_temporal(dt, "target", "user", "date")

statist-bhfz/resampleR documentation built on Sept. 2, 2019, 8:14 p.m.