rSplit: Stratified Random Split Sampling

View source: R/rSplit.R

rSplitR Documentation

Stratified Random Split Sampling

Description

Random split sampling, stratified based on the type of the response.

Usage

rSplit(y, nsplit, stratify = TRUE, s_ratio = 0.8, ...)

Arguments

y

a double vector, a logical vector, a factor, or a Surv object, response y

nsplit

positive integer scalar, number of replicates of random splits to be performed

stratify

logical scalar, whether stratification based on response y needs to be implemented, default TRUE

s_ratio

double scalar between 0 and 1, split ratio, i.e., percentage of training subjects p, default .8

...

additional parameters, currently not in use

Details

Function rSplit performs random split sampling, with or without stratification. Specifically,

  • If stratify = FALSE, or if we have a double response y, then split the sample into a training and a test set by odds p/(1-p), without stratification.

  • Otherwise, split a Surv response y, stratified by its censoring status. Specifically, split subjects with observed event into a training and a test set by odds p/(1-p), and split the censored subjects into a training and a test set by odds p/(1-p). Then combine the training sets from subjects with observed events and censored subjects, and combine the test sets from subjects with observed events and censored subjects.

  • Otherwise, split a logical response y, stratified by itself. Specifically, split the subjects with TRUE response into a training and a test set by odds p/(1-p), and split the subjects with FALSE response into a training and a test set by odds p/(1-p). Then combine the training sets, and the test sets, in a similar fashion as described above.

  • Otherwise, split a factor response y, stratified by its levels. Specifically, split the subjects in each level of y into a training and a test set by odds p/(1-p). Then combine the training sets, and the test sets, from all levels of y.

Value

Function rSplit returns a length-nsplit list of logical vectors. In each logical vector, the TRUE elements indicate training subjects and the FALSE elements indicate test subjects.

Note

caTools::sample.split is not what we need.

See Also

split, caret::createDataPartition

Examples

rSplit(y = rep(c(TRUE, FALSE), times = c(20, 30)), nsplit = 3L)


Qindex documentation built on April 4, 2025, 2:14 a.m.