View source: R/collapse_groups.R
collapse_groups | R Documentation |
Collapses a set of groups into a smaller set of groups.
Attempts to balance the new groups by specified numerical columns, categorical columns, level counts in ID columns, and/or the number of rows (size).
Note: The more of these you balance at a time,
the less balanced each of them may become. While, on average,
the balancing work better than without, this is
not guaranteed on every run. Enabling `auto_tune`
can yield a
much better overall balance than without in most contexts.
This generates a larger set of group columns using all combinations of the
balancing columns and selects the most balanced group column(s).
This is slower and we recommend enabling parallelization (see `parallel`
).
While this balancing algorithm will not be optimal in all cases, it allows balancing a large number of columns at once. Especially with auto-tuning enabled, this can be very powerful.
Tip: Check the balances of the new groups with
summarize_balances()
and
ranked_balances()
.
Note: The categorical and ID balancing algorithms are different to those
in fold()
and
partition()
.
collapse_groups(
data,
n,
group_cols,
cat_cols = NULL,
cat_levels = NULL,
num_cols = NULL,
id_cols = NULL,
balance_size = TRUE,
auto_tune = FALSE,
weights = NULL,
method = "balance",
group_aggregation_fn = mean,
num_new_group_cols = 1,
unique_new_group_cols_only = TRUE,
max_iters = 5,
extreme_pairing_levels = 1,
combine_method = "avg_standardized",
col_name = ".coll_groups",
parallel = FALSE,
verbose = TRUE
)
data |
|
n |
Number of new groups. When |
group_cols |
Names of factors in Multiple names are treated as in Note: Do not confuse these group columns with potential columns that |
cat_cols |
Names of categorical columns to balance the average frequency of one or more levels of. |
cat_levels |
Names of the levels in the The weights are automatically scaled to sum to Can be When
|
num_cols |
Names of numerical columns to balance between groups. |
id_cols |
Names of factor columns with IDs to balance the counts of between groups. E.g. useful to get a similar number of participants in each group. |
balance_size |
Whether to balance the size of the collapsed groups. (logical) |
auto_tune |
Whether to create a larger set of collapsed group columns from all combinations of the balancing dimensions and select the overall most balanced group column(s). This tends to create much more balanced collapsed group columns. Can be slow, why we recommend enabling parallelization (see |
weights |
Named The weights are automatically scaled to sum to Dimensions that are not given a weight is automatically given the weight E.g. |
method |
After calculating a combined balancing column from each of the balancing columns (see
|
group_aggregation_fn |
Function for aggregating values in the Default is When using N.B. Only used when |
num_new_group_cols |
Number of group columns to create. When N.B. When |
unique_new_group_cols_only |
Whether to only return unique new group columns. As the number of column comparisons can be quite time consuming,
we recommend enabling parallelization. See N.B. We can end up with fewer columns than specified in
N.B. Only used when |
max_iters |
Maximum number of attempts at reaching
When only keeping unique new group columns, we risk having fewer columns than expected.
Hence, we repeatedly create the missing columns and remove those that are not unique.
This is done until we have In some cases, it is not possible to create N.B. Only used when |
extreme_pairing_levels |
How many levels of extreme pairing to do
when balancing the groups by the combined balancing column (see Extreme pairing: Rows/pairs are ordered as smallest, largest,
second smallest, second largest, etc. If N.B. Larger values work best with large datasets. If set too high, the result might not be stochastic. Always check if an increase actually makes the groups more balanced. |
combine_method |
Method to combine the balancing columns by.
One of For each balancing column (all columns in The three steps are:
|
col_name |
Name of the new group column. When creating multiple new group columns
( |
parallel |
Whether to parallelize the group column comparisons
when Especially highly recommended when Requires a registered parallel backend.
Like |
verbose |
Whether to print information about the process. May make the function slightly slower. N.B. Currently only used during auto-tuning. |
The goal of collapse_groups()
is to combine existing groups
to a lower number of groups while (optionally) balancing one or more
numeric, categorical and/or ID columns, along with the group
size.
For each of these columns (and size), we calculate a normalized, numeric "balancing column" that when balanced between the groups lead to its original column being balanced as well.
To balance multiple columns at once, we combine their balancing columns with
weighted averaging (see `combine_method`
and `weights`
) to a single
combined balancing column.
Finally, we create groups where this combined balancing column is balanced between the groups,
using the numerical balancing in fold()
.
This strategy is not guaranteed to produce balanced groups in all contexts,
e.g. when the balancing columns cancel out. To increase the probability of
balanced groups, we can produce multiple group columns with all combinations
of the balancing columns and select the overall most balanced group column(s).
We refer to this as auto-tuning (see `auto_tune`
).
We find the overall most balanced group column by ranking the across-group
standard deviations for each of the balancing columns, as found with
summarize_balances()
.
Example of finding the overall most balanced group column(s):
Given a group column with the following average age per group: `c(16, 18, 25, 21)`
,
the standard deviation hereof (3.92
) is a measure of how balanced the age
column is. Another group column can thus have a lower/higher standard deviation
and be considered more/less balanced.
We find the rankings of these standard deviations for all the balancing columns
and average them (again weighted by `weights`
). We select the group column(s) with the,
on average, highest rank (i.e. lowest standard deviations).
We highly recommend using
summarize_balances()
and ranked_balances()
to
check how balanced the created groups are on the various dimensions.
When applying ranked_balances()
to the output of summarize_balances()
,
we get a data.frame
with the standard deviations
for each balancing dimension (lower means more balanced),
ordered by the average rank (see Examples
).
The following describes the creation of the balancing columns for each of the supported column types:
For each column in `cat_cols`
:
Count each level within each group. This creates a data.frame
with
one count column per level, with one row per group.
Standardize the count columns.
Average the standardized counts rowwise to create one combined column representing
the balance of the levels for each group. When cat_levels
contains weights for each of the levels,
we apply weighted averaging.
Example: Consider a factor column with the levels c("A", "B", "C")
.
We count each level per group, normalize the counts and combine them with weighted averaging:
Group | A | B | C | -> | nA | nB | nC | -> | Combined |
1 | 5 | 57 | 1 | | | 0.24 | 0.55 | -0.77 | | | 0.007 |
2 | 7 | 69 | 2 | | | 0.93 | 0.64 | -0.77 | | | 0.267 |
3 | 2 | 34 | 14 | | | -1.42 | 0.29 | 1.34 | | | 0.07 |
4 | 5 | 0 | 4 | | | 0.24 | -1.48 | 0.19 | | | -0.35 |
... | ... | ... | ... | | | ... | ... | ... | | | ... |
For each column in `id_cols`
:
Count the unique IDs (levels) within each group. (Note: The same ID can be counted in multiple groups.)
For each column in `num_cols`
:
Aggregate the numeric columns by group using the `group_aggregation_fn`
.
Count the number of rows per group.
Apply standardization or MinMax scaling to each of the balancing columns (see `combine_method`
).
Perform weighted averaging to get a single balancing column (see `weights`
).
Example: We apply standardization and perform weighted averaging:
Group | Size | Num | Cat | ID | -> | nSize | nNum | nCat | nID | -> | Combined |
1 | 34 | 1.3 | 0.007 | 3 | | | -0.33 | -0.82 | 0.03 | -0.46 | | | -0.395 |
2 | 23 | 4.6 | 0.267 | 4 | | | -1.12 | 0.34 | 1.04 | 0.0 | | | 0.065 |
3 | 56 | 7.2 | 0.07 | 7 | | | 1.27 | 1.26 | 0.28 | 1.39 | | | 1.05 |
4 | 41 | 1.4 | -0.35 | 2 | | | 0.18 | -0.79 | -1.35 | -0.93 | | | -0.723 |
... | ... | ... | ... | ... | | | ... | ... | ... | ... | | | ... |
Finally, we get to the group creation. There are three methods for creating groups based on the
combined balancing column: "balance"
(default), "ascending"
, and "descending"
.
method
is "balance"To create groups that are balanced by the combined balancing column, we use the numerical balancing
in fold()
.
The following describes the numerical balancing in broad terms:
Rows are shuffled. Note that this will only affect rows with the same value in the combined balancing column.
Extreme pairing 1: Rows are ordered as smallest, largest, second smallest, second largest, etc.
Each small+large pair get an extreme-group identifier. (See rearrr::pair_extremes()
)
If `extreme_pairing_levels` > 1
: These extreme-group identifiers are reordered as smallest,
largest, second smallest, second largest, etc., by the sum
of the combined balancing column in the represented rows.
These pairs (of pairs) get a new set of extreme-group identifiers, and the process is repeated
`extreme_pairing_levels`-2
times. Note that the extreme-group identifiers at the last level will represent
2^`extreme_pairing_levels`
rows, why you should be careful when choosing a larger setting.
The extreme-group identifiers from the last pairing are randomly divided into the final groups and these final identifiers are transferred to the original rows.
N.B. When doing extreme pairing of an unequal number of rows, the row with the smallest value is placed in a group by itself, and the order is instead: (smallest), (second smallest, largest), (third smallest, second largest), etc.
A similar approach with extreme triplets (i.e. smallest, closest to median, largest,
second smallest, second closest to median, second largest, etc.) may also be utilized in some scenarios.
(See rearrr::triplet_extremes()
)
Example: We order the data.frame
by smallest "Num" value,
largest "Num" value, second smallest, and so on.
We could further (when `extreme_pairing_levels` > 1
)
find the sum of "Num" for each pair and perform extreme pairing on the pairs.
Finally, we group the data.frame
:
Group | Num | -> | Group | Num | Pair | -> | New group |
1 | -0.395 | | | 5 | -1.23 | 1 | | | 3 |
2 | 0.065 | | | 3 | 1.05 | 1 | | | 3 |
3 | 1.05 | | | 4 | -0.723 | 2 | | | 1 |
4 | -0.723 | | | 2 | 0.065 | 2 | | | 1 |
5 | -1.23 | | | 1 | -0.395 | 3 | | | 2 |
6 | -0.15 | | | 6 | -0.15 | 3 | | | 2 |
... | ... | | | ... | ... | ... | | | ... |
method
is "ascending" or "descending"These methods order the data by the combined balancing column and
creates groups such that the sums get increasingly larger (`ascending`
)
or smaller (`descending`
). This will in turn lead to a pattern of
increasing/decreasing sums in the balancing columns (e.g. increasing/decreasing counts
of the categorical levels, counts of IDs, number of rows and sums of numeric columns).
data.frame
with one or more new grouping factors.
Ludvig Renbo Olsen, r-pkgs@ludvigolsen.dk
fold()
for creating balanced folds/groups.
partition()
for creating balanced partitions.
Other grouping functions:
all_groups_identical()
,
collapse_groups_by
,
fold()
,
group_factor()
,
group()
,
partition()
,
splt()
# Attach packages
library(groupdata2)
library(dplyr)
# Set seed
if (requireNamespace("xpectr", quietly = TRUE)){
xpectr::set_test_seed(42)
}
# Create data frame
df <- data.frame(
"participant" = factor(rep(1:20, 3)),
"age" = rep(sample(c(1:100), 20), 3),
"answer" = factor(sample(c("a", "b", "c", "d"), 60, replace = TRUE)),
"score" = sample(c(1:100), 20 * 3)
)
df <- df %>% dplyr::arrange(participant)
df$session <- rep(c("1", "2", "3"), 20)
# Sample rows to get unequal sizes per participant
df <- dplyr::sample_n(df, size = 53)
# Create the initial groups (to be collapsed)
df <- fold(
data = df,
k = 8,
method = "n_dist",
id_col = "participant"
)
# Ungroup the data frame
# Otherwise `collapse_groups()` would be
# applied to each fold separately!
df <- dplyr::ungroup(df)
# NOTE: Make sure to check the examples with `auto_tune`
# in the end, as this is where the magic lies
# Collapse to 3 groups with size balancing
# Creates new `.coll_groups` column
df_coll <- collapse_groups(
data = df,
n = 3,
group_cols = ".folds",
balance_size = TRUE # enabled by default
)
# Check balances
(coll_summary <- summarize_balances(
data = df_coll,
group_cols = ".coll_groups",
cat_cols = 'answer',
num_cols = c('score', 'age'),
id_cols = 'participant'
))
# Get ranked balances
# NOTE: When we only have a single new group column
# we don't get ranks - but this is good to use
# when comparing multiple group columns!
# The scores are standard deviations across groups
ranked_balances(coll_summary)
# Collapse to 3 groups with size + *categorical* balancing
# We create 2 new `.coll_groups_1/2` columns
df_coll <- collapse_groups(
data = df,
n = 3,
group_cols = ".folds",
cat_cols = "answer",
balance_size = TRUE,
num_new_group_cols = 2
)
# Check balances
# To simplify the output, we only find the
# balance of the `answer` column
(coll_summary <- summarize_balances(
data = df_coll,
group_cols = paste0(".coll_groups_", 1:2),
cat_cols = 'answer'
))
# Get ranked balances
# All scores are standard deviations across groups or (average) ranks
# Rows are ranked by most to least balanced
# (i.e. lowest average SD rank)
ranked_balances(coll_summary)
# Collapse to 3 groups with size + categorical + *numerical* balancing
# We create 2 new `.coll_groups_1/2` columns
df_coll <- collapse_groups(
data = df,
n = 3,
group_cols = ".folds",
cat_cols = "answer",
num_cols = "score",
balance_size = TRUE,
num_new_group_cols = 2
)
# Check balances
(coll_summary <- summarize_balances(
data = df_coll,
group_cols = paste0(".coll_groups_", 1:2),
cat_cols = 'answer',
num_cols = 'score'
))
# Get ranked balances
# All scores are standard deviations across groups or (average) ranks
ranked_balances(coll_summary)
# Collapse to 3 groups with size and *ID* balancing
# We create 2 new `.coll_groups_1/2` columns
df_coll <- collapse_groups(
data = df,
n = 3,
group_cols = ".folds",
id_cols = "participant",
balance_size = TRUE,
num_new_group_cols = 2
)
# Check balances
# To simplify the output, we only find the
# balance of the `participant` column
(coll_summary <- summarize_balances(
data = df_coll,
group_cols = paste0(".coll_groups_", 1:2),
id_cols = 'participant'
))
# Get ranked balances
# All scores are standard deviations across groups or (average) ranks
ranked_balances(coll_summary)
###################
#### Auto-tune ####
# As you might have seen, the balancing does not always
# perform as optimal as we might want or need
# To get a better balance, we can enable `auto_tune`
# which will create a larger set of collapsings
# and select the most balanced new group columns
# While it is not required, we recommend
# enabling parallelization
## Not run:
# Uncomment for parallelization
# library(doParallel)
# doParallel::registerDoParallel(7) # use 7 cores
# Collapse to 3 groups with lots of balancing
# We enable `auto_tune` to get a more balanced set of columns
# We create 10 new `.coll_groups_1/2/...` columns
df_coll <- collapse_groups(
data = df,
n = 3,
group_cols = ".folds",
cat_cols = "answer",
num_cols = "score",
id_cols = "participant",
balance_size = TRUE,
num_new_group_cols = 10,
auto_tune = TRUE,
parallel = FALSE # Set to TRUE for parallelization!
)
# Check balances
# To simplify the output, we only find the
# balance of the `participant` column
(coll_summary <- summarize_balances(
data = df_coll,
group_cols = paste0(".coll_groups_", 1:10),
cat_cols = "answer",
num_cols = "score",
id_cols = 'participant'
))
# Get ranked balances
# All scores are standard deviations across groups or (average) ranks
ranked_balances(coll_summary)
# Now we can choose the .coll_groups_* column(s)
# that we favor the balance of
# and move on with our lives!
## End(Not run)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.