tof_split_data: Split high-dimensional cytometry data into a training and...

View source: R/patient-level_modeling.R

tof_split_dataR Documentation

Split high-dimensional cytometry data into a training and test set

Description

Split high-dimensional cytometry data into a training and test set

Usage

tof_split_data(
  feature_tibble,
  split_method = c("k-fold", "bootstrap", "simple"),
  split_col,
  simple_prop = 3/4,
  num_cv_folds = 10,
  num_cv_repeats = 1L,
  num_bootstraps = 10,
  strata = NULL,
  ...
)

Arguments

feature_tibble

A tibble in which each row represents a sample- or patient- level observation, such as those produced by tof_extract_features.

split_method

Either a string or a logical vector specifying how to perform the split. If a string, valid options include k-fold cross validation ("k-fold"; the default), bootstrapping ("bootstrap"), or a single binary split ("simple"). If a logical vector, it should contain one entry for each row in 'feature_tibble' indicating if that row should be included in the training set (TRUE) or excluded for the validation/test set (FALSE). Ignored entirely if 'split_col' is specified.

split_col

The unquoted column name of the logical column in 'feature_tibble' indicating if each row should be included in the training set (TRUE) or excluded for the validation/test set (FALSE).

simple_prop

A numeric value between 0 and 1 indicating what proportion of the data should be used for training. Defaults to 3/4. Ignored if split_method is not "simple".

num_cv_folds

An integer indicating how many cross-validation folds should be used. Defaults to 10. Ignored if split_method is not "k-fold".

num_cv_repeats

An integer indicating how many independent cross-validation replicates should be used (i.e. how many num_cv_fold splits should be performed). Defaults to 1. Ignored if split_method is not "k-fold".

num_bootstraps

An integer indicating how many independent bootstrap replicates should be used. Defaults to 25. Ignored if split_method is not "bootstrap".

strata

An unquoted column name representing the column in feature_tibble that should be used to stratify the data splitting. Defaults to NULL (no stratification).

...

Optional additional arguments to pass to vfold_cv for k-fold cross validation, bootstraps for bootstrapping, or initial_split for simple splitting.

Value

If for k-fold cross validation and bootstrapping, an "rset" object; for simple splitting, an "rsplit" object. For details, see rsample.

See Also

Other modeling functions: tof_assess_model(), tof_create_grid(), tof_predict(), tof_train_model()

Examples

feature_tibble <-
    dplyr::tibble(
        sample = as.character(1:100),
        cd45 = runif(n = 100),
        pstat5 = runif(n = 100),
        cd34 = runif(n = 100),
        outcome = (3 * cd45) + (4 * pstat5) + rnorm(100),
        class =
            as.factor(
                dplyr::if_else(outcome > median(outcome), "class1", "class2")
            ),
        multiclass =
            as.factor(
                c(rep("class1", 30), rep("class2", 30), rep("class3", 40))
            ),
        event = c(rep(0, times = 50), rep(1, times = 50)),
        time_to_event = rnorm(n = 100, mean = 10, sd = 2)
    )

# split the dataset into 10 CV folds
tof_split_data(
    feature_tibble = feature_tibble,
    split_method = "k-fold"
)

# split the dataset into 10 bootstrap resamplings
tof_split_data(
    feature_tibble = feature_tibble,
    split_method = "bootstrap"
)

# split the dataset into a single training/test set
# stratified by the "class" column
tof_split_data(
    feature_tibble = feature_tibble,
    split_method = "simple",
    strata = class
)


keyes-timothy/tidytof documentation built on Aug. 28, 2024, 8:37 a.m.