partition: Create balanced partitions

View source: R/partition.R

partitionR Documentation

Create balanced partitions

Description

\Sexpr[results=rd, stage=render]{lifecycle::badge("stable")}

Splits data into partitions. Balances a given categorical variable and/or numerical variable between partitions and keeps (if possible) all data points with a shared ID (e.g. participant_id) in the same partition.

Usage

partition(
  data,
  p = 0.2,
  cat_col = NULL,
  num_col = NULL,
  id_col = NULL,
  id_aggregation_fn = sum,
  extreme_pairing_levels = 1,
  force_equal = FALSE,
  list_out = TRUE
)

Arguments

data

data.frame. Can be grouped, in which case the function is applied group-wise.

p

List or vector of partition sizes. Given as whole number(s) and/or percentage(s) (0 < `p` < 1).

E.g. c(0.2, 3, 0.1).

cat_col

Name of categorical variable to balance between partitions.

E.g. when training and testing a model for predicting a binary variable (a or b), we usually want both classes represented in both the training set and the test set.

N.B. If also passing an `id_col`, `cat_col` should be constant within each ID.

num_col

Name of numerical variable to balance between partitions.

N.B. When used with `id_col`, values in `num_col` for each ID are aggregated using `id_aggregation_fn` before being balanced.

id_col

Name of factor with IDs. Used to keep all rows that share an ID in the same partition (if possible).

E.g. If we have measured a participant multiple times and want to see the effect of time, we want to have all observations of this participant in the same partition.

N.B. When `data` is a grouped data.frame (see dplyr::group_by()), IDs that appear in multiple groupings might end up in different partitions in those groupings.

id_aggregation_fn

Function for aggregating values in `num_col` for each ID, before balancing `num_col`.

N.B. Only used when `num_col` and `id_col` are both specified.

extreme_pairing_levels

How many levels of extreme pairing to do when balancing partitions by a numerical column (i.e. `num_col` is specified).

Extreme pairing: Rows/pairs are ordered as smallest, largest, second smallest, second largest, etc. If `extreme_pairing_levels` > 1, this is done "recursively" on the extreme pairs. See `Details/num_col` for more.

N.B. Larger values work best with large datasets. If set too high, the result might not be stochastic. Always check if an increase actually makes the partitions more balanced. See `Examples`.

force_equal

Whether to discard excess data. (Logical)

list_out

Whether to return partitions in a list. (Logical)

N.B. When `data` is a grouped data.frame, the output is always a data.frame with partition identifiers.

Details

cat_col

  1. `data` is subset by `cat_col`.

  2. Subsets are partitioned and merged.

id_col

  1. Partitions are created from unique IDs.

num_col

  1. Rows are shuffled. Note that this will only affect rows with the same value in `num_col`.

  2. Extreme pairing 1: Rows are ordered as smallest, largest, second smallest, second largest, etc. Each pair get a group identifier.

  3. If `extreme_pairing_levels` > 1: The group identifiers are reordered as smallest, largest, second smallest, second largest, etc., by the sum of `num_col` in the represented rows. These pairs (of pairs) get a new set of group identifiers, and the process is repeated `extreme_pairing_levels`-2 times. Note that the group identifiers at the last level will represent 2^`extreme_pairing_levels` rows, why you should be careful when choosing that setting.

  4. The final group identifiers are shuffled, and their order is applied to the full dataset.

  5. The ordered dataset is split by the sizes in `p`.

N.B. When doing extreme pairing of an unequal number of rows, the row with the largest value is placed in a group by itself, and the order is instead: smallest, second largest, second smallest, third largest, ... , largest.

cat_col AND id_col

  1. `data` is subset by `cat_col`.

  2. Partitions are created from unique IDs in each subset.

  3. Subsets are merged.

cat_col AND num_col

  1. `data` is subset by `cat_col`.

  2. Subsets are partitioned by `num_col`.

  3. Subsets are merged.

num_col AND id_col

  1. Values in `num_col` are aggregated for each ID, using id_aggregation_fn.

  2. The IDs are partitioned, using the aggregated values as "num_col".

  3. The partition identifiers are transferred to the rows of the IDs.

cat_col AND num_col AND id_col

  1. Values in `num_col` are aggregated for each ID, using id_aggregation_fn.

  2. IDs are subset by `cat_col`.

  3. The IDs for each subset are partitioned, by using the aggregated values as "num_col".

  4. The partition identifiers are transferred to the rows of the IDs.

Value

If `list_out` is TRUE:

A list of partitions where partitions are data.frames.

If `list_out` is FALSE:

A data.frame with grouping factor for subsetting.

N.B. When `data` is a grouped data.frame, the output is always a data.frame with a grouping factor.

Author(s)

Ludvig Renbo Olsen, r-pkgs@ludvigolsen.dk

See Also

Other grouping functions: all_groups_identical(), collapse_groups(), collapse_groups_by, fold(), group(), group_factor(), splt()

Examples

# Attach packages
library(groupdata2)
library(dplyr)

# Create data frame
df <- data.frame(
  "participant" = factor(rep(c("1", "2", "3", "4", "5", "6"), 3)),
  "age" = rep(sample(c(1:100), 6), 3),
  "diagnosis" = factor(rep(c("a", "b", "a", "a", "b", "b"), 3)),
  "score" = sample(c(1:100), 3 * 6)
)
df <- df %>% arrange(participant)
df$session <- rep(c("1", "2", "3"), 6)

# Using partition()

# Without balancing
partitions <- partition(data = df, p = c(0.2, 0.3))

# With cat_col
partitions <- partition(data = df, p = 0.5, cat_col = "diagnosis")

# With id_col
partitions <- partition(data = df, p = 0.5, id_col = "participant")

# With num_col
partitions <- partition(data = df, p = 0.5, num_col = "score")

# With cat_col and id_col
partitions <- partition(
  data = df,
  p = 0.5,
  cat_col = "diagnosis",
  id_col = "participant"
)

# With cat_col, num_col and id_col
partitions <- partition(
  data = df,
  p = 0.5,
  cat_col = "diagnosis",
  num_col = "score",
  id_col = "participant"
)

# Return data frame with grouping factor
# with list_out = FALSE
partitions <- partition(df, c(0.5), list_out = FALSE)

# Check if additional extreme_pairing_levels
# improve the numerical balance
set.seed(2) # try with seed 1 as well
partitions_1 <- partition(
  data = df,
  p = 0.5,
  num_col = "score",
  extreme_pairing_levels = 1,
  list_out = FALSE
)
partitions_1 %>%
  dplyr::group_by(.partitions) %>%
  dplyr::summarise(
    sum_score = sum(score),
    mean_score = mean(score)
  )
set.seed(2) # try with seed 1 as well
partitions_2 <- partition(
  data = df,
  p = 0.5,
  num_col = "score",
  extreme_pairing_levels = 2,
  list_out = FALSE
)
partitions_2 %>%
  dplyr::group_by(.partitions) %>%
  dplyr::summarise(
    sum_score = sum(score),
    mean_score = mean(score)
  )

LudvigOlsen/R-splitters documentation built on March 7, 2024, 6:59 p.m.