partition | R Documentation |
Splits data into partitions. Balances a given categorical variable and/or numerical variable between partitions and keeps (if possible) all data points with a shared ID (e.g. participant_id) in the same partition.
partition(
data,
p = 0.2,
cat_col = NULL,
num_col = NULL,
id_col = NULL,
id_aggregation_fn = sum,
extreme_pairing_levels = 1,
force_equal = FALSE,
list_out = TRUE
)
data |
|
p |
List or vector of partition sizes.
Given as whole number(s) and/or percentage(s) ( E.g. |
cat_col |
Name of categorical variable to balance between partitions. E.g. when training and testing a model for predicting a binary variable (a or b), we usually want both classes represented in both the training set and the test set. N.B. If also passing an |
num_col |
Name of numerical variable to balance between partitions. N.B. When used with |
id_col |
Name of factor with IDs. Used to keep all rows that share an ID in the same partition (if possible). E.g. If we have measured a participant multiple times and want to see the effect of time, we want to have all observations of this participant in the same partition. N.B. When |
id_aggregation_fn |
Function for aggregating values in N.B. Only used when |
extreme_pairing_levels |
How many levels of extreme pairing to do
when balancing partitions by a numerical column (i.e. Extreme pairing: Rows/pairs are ordered as smallest, largest,
second smallest, second largest, etc. If N.B. Larger values work best with large datasets. If set too high,
the result might not be stochastic. Always check if an increase
actually makes the partitions more balanced. See |
force_equal |
Whether to discard excess data. (Logical) |
list_out |
Whether to return partitions in a N.B. When |
`data`
is subset by `cat_col`
.
Subsets are partitioned and merged.
Partitions are created from unique IDs.
Rows are shuffled. Note that this will only affect rows with the same value in `num_col`
.
Extreme pairing 1: Rows are ordered as smallest, largest, second smallest, second largest, etc. Each pair get a group identifier.
If `extreme_pairing_levels` > 1
: The group identifiers are reordered as smallest,
largest, second smallest, second largest, etc., by the sum of `num_col`
in the represented rows.
These pairs (of pairs) get a new set of group identifiers, and the process is repeated
`extreme_pairing_levels`-2
times. Note that the group identifiers at the last level will represent
2^`extreme_pairing_levels`
rows, why you should be careful when choosing that setting.
The final group identifiers are shuffled, and their order is applied to the full dataset.
The ordered dataset is split by the sizes in `p`
.
N.B. When doing extreme pairing of an unequal number of rows, the row with the largest value is placed in a group by itself, and the order is instead: smallest, second largest, second smallest, third largest, ... , largest.
`data`
is subset by `cat_col`
.
Partitions are created from unique IDs in each subset.
Subsets are merged.
`data`
is subset by `cat_col`
.
Subsets are partitioned by `num_col`
.
Subsets are merged.
Values in `num_col`
are aggregated for each ID, using id_aggregation_fn
.
The IDs are partitioned, using the aggregated values as "num_col
".
The partition identifiers are transferred to the rows of the IDs.
Values in `num_col`
are aggregated for each ID, using id_aggregation_fn
.
IDs are subset by `cat_col`
.
The IDs for each subset are partitioned,
by using the aggregated values as "num_col
".
The partition identifiers are transferred to the rows of the IDs.
If `list_out`
is TRUE
:
A list
of partitions where partitions are data.frame
s.
If `list_out`
is FALSE
:
A data.frame
with grouping factor for subsetting.
N.B. When `data`
is a grouped data.frame
,
the output is always a data.frame
with a grouping factor.
Ludvig Renbo Olsen, r-pkgs@ludvigolsen.dk
Other grouping functions:
all_groups_identical()
,
collapse_groups_by
,
collapse_groups()
,
fold()
,
group_factor()
,
group()
,
splt()
# Attach packages
library(groupdata2)
library(dplyr)
# Create data frame
df <- data.frame(
"participant" = factor(rep(c("1", "2", "3", "4", "5", "6"), 3)),
"age" = rep(sample(c(1:100), 6), 3),
"diagnosis" = factor(rep(c("a", "b", "a", "a", "b", "b"), 3)),
"score" = sample(c(1:100), 3 * 6)
)
df <- df %>% arrange(participant)
df$session <- rep(c("1", "2", "3"), 6)
# Using partition()
# Without balancing
partitions <- partition(data = df, p = c(0.2, 0.3))
# With cat_col
partitions <- partition(data = df, p = 0.5, cat_col = "diagnosis")
# With id_col
partitions <- partition(data = df, p = 0.5, id_col = "participant")
# With num_col
partitions <- partition(data = df, p = 0.5, num_col = "score")
# With cat_col and id_col
partitions <- partition(
data = df,
p = 0.5,
cat_col = "diagnosis",
id_col = "participant"
)
# With cat_col, num_col and id_col
partitions <- partition(
data = df,
p = 0.5,
cat_col = "diagnosis",
num_col = "score",
id_col = "participant"
)
# Return data frame with grouping factor
# with list_out = FALSE
partitions <- partition(df, c(0.5), list_out = FALSE)
# Check if additional extreme_pairing_levels
# improve the numerical balance
set.seed(2) # try with seed 1 as well
partitions_1 <- partition(
data = df,
p = 0.5,
num_col = "score",
extreme_pairing_levels = 1,
list_out = FALSE
)
partitions_1 %>%
dplyr::group_by(.partitions) %>%
dplyr::summarise(
sum_score = sum(score),
mean_score = mean(score)
)
set.seed(2) # try with seed 1 as well
partitions_2 <- partition(
data = df,
p = 0.5,
num_col = "score",
extreme_pairing_levels = 2,
list_out = FALSE
)
partitions_2 %>%
dplyr::group_by(.partitions) %>%
dplyr::summarise(
sum_score = sum(score),
mean_score = mean(score)
)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.