balance: Balance groups by up- and downsampling
In LudvigOlsen/groupdata2: Creating Groups from Data

balance

R Documentation

Balance groups by up- and downsampling

Description

\Sexpr[results=rd, stage=render]{lifecycle::badge("maturing")}

Uses up- and/or downsampling to fix the group sizes to the min, max, mean, or median group size or to a specific number of rows. Has a range of methods for balancing on ID level.

Usage

balance(
  data,
  size,
  cat_col,
  id_col = NULL,
  id_method = "n_ids",
  mark_new_rows = FALSE,
  new_rows_col_name = ".new_row"
)

Arguments

`data`	`data.frame`. Can be grouped, in which case the function is applied group-wise.
`size`	Size to fix group sizes to. Can be a specific number, given as a whole number, or one of the following strings: `"min"`, `"max"`, `"mean"`, `"median"`. number Fix each group to have the size of the specified number of row. Uses downsampling for groups with too many rows and upsampling for groups with too few rows. min Fix each group to have the size of smallest group in the dataset. Uses downsampling on all groups that have too many rows. max Fix each group to have the size of largest group in the dataset. Uses upsampling on all groups that have too few rows. mean Fix each group to have the mean group size in the dataset. The mean is rounded. Uses downsampling for groups with too many rows and upsampling for groups with too few rows. median Fix each group to have the median group size in the dataset. The median is rounded. Uses downsampling for groups with too many rows and upsampling for groups with too few rows.
`cat_col`	Name of categorical variable to balance by. (Character)
`id_col`	Name of factor with IDs. (Character) IDs are considered entities, e.g. allowing us to add or remove all rows for an ID. How this is used is up to the `id_method`. E.g. If we have measured a participant multiple times and want make sure that we keep all these measurements. Then we would either remove/add all measurements for the participant or leave in all measurements for the participant. N.B. When `data` is a grouped `data.frame` (see `dplyr::group_by()`), IDs that appear in multiple groupings are considered separate entities within those groupings.
`id_method`	Method for balancing the IDs. (Character) `"n_ids"`, `"n_rows_c"`, `"distributed"`, or `"nested"`. n_ids (default) Balances on ID level only. It makes sure there are the same number of IDs for each category. This might lead to a different number of rows between categories. n_rows_c Attempts to level the number of rows per category, while only removing/adding entire IDs. This is done in 2 steps: If a category needs to add all its rows one or more times, the data is repeated. Iteratively, the ID with the number of rows closest to the lacking/excessive number of rows is added/removed. This happens until adding/removing the closest ID would lead to a size further from the target size than the current size. If multiple IDs are closest, one is randomly sampled. distributed Distributes the lacking/excess rows equally between the IDs. If the number to distribute can not be equally divided, some IDs will have 1 row more/less than the others. nested Calls `balance()` on each category with IDs as cat_col. I.e. if size is `"min"`, IDs will have the size of the smallest ID in their category.
`mark_new_rows`	Add column with `1`s for added rows, and `0`s for original rows. (Logical)
`new_rows_col_name`	Name of column marking new rows. Defaults to `".new_row"`.

Details

Without `id_col`

Upsampling is done with replacement for added rows, while the original data remains intact. Downsampling is done without replacement, meaning that rows are not duplicated but only removed.

With `id_col`

See `id_method` description.

Value

data.frame with added and/or deleted rows. Ordered by potential grouping variables, `cat_col` and (potentially) `id_col`.

Author(s)

Ludvig Renbo Olsen, r-pkgs@ludvigolsen.dk

Examples

# Attach packages
library(groupdata2)

# Create data frame
df <- data.frame(
  "participant" = factor(c(1, 1, 2, 3, 3, 3, 3, 4, 4, 5, 5, 5, 5)),
  "diagnosis" = factor(c(0, 0, 1, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0)),
  "trial" = c(1, 2, 1, 1, 2, 3, 4, 1, 2, 1, 2, 3, 4),
  "score" = sample(c(1:100), 13)
)

# Using balance() with specific number of rows
balance(df, 3, cat_col = "diagnosis")

# Using balance() with min
balance(df, "min", cat_col = "diagnosis")

# Using balance() with max
balance(df, "max", cat_col = "diagnosis")

# Using balance() with id_method "n_ids"
# With column specifying added rows
balance(df, "max",
  cat_col = "diagnosis",
  id_col = "participant",
  id_method = "n_ids",
  mark_new_rows = TRUE
)

# Using balance() with id_method "n_rows_c"
# With column specifying added rows
balance(df, "max",
  cat_col = "diagnosis",
  id_col = "participant",
  id_method = "n_rows_c",
  mark_new_rows = TRUE
)

# Using balance() with id_method "distributed"
# With column specifying added rows
balance(df, "max",
  cat_col = "diagnosis",
  id_col = "participant",
  id_method = "distributed",
  mark_new_rows = TRUE
)

# Using balance() with id_method "nested"
# With column specifying added rows
balance(df, "max",
  cat_col = "diagnosis",
  id_col = "participant",
  id_method = "nested",
  mark_new_rows = TRUE
)

LudvigOlsen/groupdata2 documentation built on Dec. 20, 2024, 7:12 p.m.

LudvigOlsen/groupdata2 index

README.md

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

LudvigOlsen/groupdata2
Creating Groups from Data

balance: Balance groups by up- and downsampling
In LudvigOlsen/groupdata2: Creating Groups from Data

Balance groups by up- and downsampling

Description

Usage

Arguments

number

min

max

mean

median

n_ids (default)

n_rows_c

distributed

nested

Details

Without `id_col`

With `id_col`

Value

Author(s)

See Also

Examples

Related to balance in LudvigOlsen/groupdata2...

R Package Documentation

Browse R Packages

We want your feedback!

LudvigOlsen/groupdata2 Creating Groups from Data

balance: Balance groups by up- and downsampling In LudvigOlsen/groupdata2: Creating Groups from Data

Balance groups by up- and downsampling

Description

Usage

Arguments

number

min

max

mean

median

n_ids (default)

n_rows_c

distributed

nested

Details

Without `id_col`

With `id_col`

Value

Author(s)

See Also

Examples

Related to balance in LudvigOlsen/groupdata2...

R Package Documentation

Browse R Packages

We want your feedback!

LudvigOlsen/groupdata2
Creating Groups from Data

balance: Balance groups by up- and downsampling
In LudvigOlsen/groupdata2: Creating Groups from Data