collapse_groups_by | R Documentation |
Collapses a set of groups into a smaller set of groups.
Balance the new groups by:
The number of rows with collapse_groups_by_size()
Numerical columns with collapse_groups_by_numeric()
One or more levels of categorical columns with collapse_groups_by_levels()
Level counts in ID columns with collapse_groups_by_ids()
Any combination of these with collapse_groups()
These functions wrap collapse_groups()
to provide a simpler interface. To balance more than one of the attributes at a time
and/or create multiple new unique grouping columns at once, use
collapse_groups()
directly.
While, on average, the balancing work better than without, this is
not guaranteed on every run. `auto_tune`
(enabled by default) can yield
a much better overall balance than without in most contexts. This generates a larger set
of group columns using all combinations of the balancing columns and selects the
most balanced group column(s). This is slower and can be speeded up by enabling
parallelization (see `parallel`
).
Tip: When speed is more important than balancing, disable `auto_tune`
.
Tip: Check the balances of the new groups with
summarize_balances()
and
ranked_balances()
.
Note: The categorical and ID balancing algorithms are different to those
in fold()
and
partition()
.
collapse_groups_by_size(
data,
n,
group_cols,
auto_tune = TRUE,
method = "balance",
col_name = ".coll_groups",
parallel = FALSE,
verbose = FALSE
)
collapse_groups_by_numeric(
data,
n,
group_cols,
num_cols,
balance_size = FALSE,
auto_tune = TRUE,
method = "balance",
group_aggregation_fn = mean,
col_name = ".coll_groups",
parallel = FALSE,
verbose = FALSE
)
collapse_groups_by_levels(
data,
n,
group_cols,
cat_cols,
cat_levels = NULL,
balance_size = FALSE,
auto_tune = TRUE,
method = "balance",
col_name = ".coll_groups",
parallel = FALSE,
verbose = FALSE
)
collapse_groups_by_ids(
data,
n,
group_cols,
id_cols,
balance_size = FALSE,
auto_tune = TRUE,
method = "balance",
col_name = ".coll_groups",
parallel = FALSE,
verbose = FALSE
)
data |
|
n |
Number of new groups. |
group_cols |
Names of factors in Multiple names are treated as in Note: Do not confuse these group columns with potential columns that |
auto_tune |
Whether to create a larger set of collapsed group columns from all combinations of the balancing dimensions and select the overall most balanced group column(s). This tends to create much more balanced collapsed group columns. Can be slow, why we recommend enabling parallelization (see |
method |
|
col_name |
Name of the new group column. When creating multiple new group columns
( |
parallel |
Whether to parallelize the group column comparisons
when Requires a registered parallel backend.
Like |
verbose |
Whether to print information about the process. May make the function slightly slower. N.B. Currently only used during auto-tuning. |
num_cols |
Names of numerical columns to balance between groups. |
balance_size |
Whether to balance the size of the collapsed groups. (logical) |
group_aggregation_fn |
Function for aggregating values in the Default is When using N.B. Only used when |
cat_cols |
Names of categorical columns to balance the average frequency of one or more levels of. |
cat_levels |
Names of the levels in the The weights are automatically scaled to sum to Can be When
|
id_cols |
Names of factor columns with IDs to balance the counts of between groups. E.g. useful to get a similar number of participants in each group. |
See details in collapse_groups()
.
`data`
with a new grouping factor column.
Ludvig Renbo Olsen, r-pkgs@ludvigolsen.dk
Other grouping functions:
all_groups_identical()
,
collapse_groups()
,
fold()
,
group_factor()
,
group()
,
partition()
,
splt()
# Attach packages
library(groupdata2)
library(dplyr)
# Set seed
if (requireNamespace("xpectr", quietly = TRUE)){
xpectr::set_test_seed(42)
}
# Create data frame
df <- data.frame(
"participant" = factor(rep(1:20, 3)),
"age" = rep(sample(c(1:100), 20), 3),
"answer" = factor(sample(c("a", "b", "c", "d"), 60, replace = TRUE)),
"score" = sample(c(1:100), 20 * 3)
)
df <- df %>% dplyr::arrange(participant)
df$session <- rep(c("1", "2", "3"), 20)
# Sample rows to get unequal sizes per participant
df <- dplyr::sample_n(df, size = 53)
# Create the initial groups (to be collapsed)
df <- fold(
data = df,
k = 8,
method = "n_dist",
id_col = "participant"
)
# Ungroup the data frame
# Otherwise `collapse_groups*()` would be
# applied to each fold separately!
df <- dplyr::ungroup(df)
# When `auto_tune` is enabled for larger datasets
# we recommend enabling parallelization
# This can be done with:
# library(doParallel)
# doParallel::registerDoParallel(7) # use 7 cores
## Not run:
# Collapse to 3 groups with size balancing
# Creates new `.coll_groups` column
df_coll <- collapse_groups_by_size(
data = df,
n = 3,
group_cols = ".folds"
)
# Check balances
(coll_summary <- summarize_balances(
data = df_coll,
group_cols = ".coll_groups"
))
# Get ranked balances
# This is most useful when having created multiple
# new group columns with `collapse_groups()`
# The scores are standard deviations across groups
ranked_balances(coll_summary)
# Collapse to 3 groups with *categorical* balancing
df_coll <- collapse_groups_by_levels(
data = df,
n = 3,
group_cols = ".folds",
cat_cols = "answer"
)
# Check balances
(coll_summary <- summarize_balances(
data = df_coll,
group_cols = ".coll_groups",
cat_cols = 'answer'
))
# Collapse to 3 groups with *numerical* balancing
# Also balance size to get similar sums
# as well as means
df_coll <- collapse_groups_by_numeric(
data = df,
n = 3,
group_cols = ".folds",
num_cols = "score",
balance_size = TRUE
)
# Check balances
(coll_summary <- summarize_balances(
data = df_coll,
group_cols = ".coll_groups",
num_cols = 'score'
))
# Collapse to 3 groups with *ID* balancing
# This should give us a similar number of IDs per group
df_coll <- collapse_groups_by_ids(
data = df,
n = 3,
group_cols = ".folds",
id_cols = "participant"
)
# Check balances
(coll_summary <- summarize_balances(
data = df_coll,
group_cols = ".coll_groups",
id_cols = 'participant'
))
# Collapse to 3 groups with balancing of ALL attributes
# We create 5 new grouping factors and compare them
# The latter is in-general a good strategy even if you
# only need a single collapsed grouping factor
# as you can choose your preferred balances
# based on the summary
# NOTE: This is slow (up to a few minutes)
# consider enabling parallelization
df_coll <- collapse_groups(
data = df,
n = 3,
num_new_group_cols = 5,
group_cols = ".folds",
cat_cols = "answer",
num_cols = 'score',
id_cols = "participant",
auto_tune = TRUE # Disabled by default in `collapse_groups()`
# parallel = TRUE # Add comma above and uncomment
)
# Check balances
(coll_summary <- summarize_balances(
data = df_coll,
group_cols = paste0(".coll_groups_", 1:5),
cat_cols = "answer",
num_cols = 'score',
id_cols = 'participant'
))
# Compare the new grouping columns
# The lowest across-group standard deviation
# is the most balanced
ranked_balances(coll_summary)
## End(Not run)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.