knitr::opts_chunk$set( collapse = TRUE, comment = "#>", fig.path = "man/figures/README-", out.width = "100%", dpi = 92, fig.retina = 2 ) # Digits to print options("digits"=3) set.seed(1) # Get minimum R requirement dep <- as.vector(read.dcf('DESCRIPTION')[, 'Depends']) rvers <- substring(dep, 7, nchar(dep)-1) # m <- regexpr('R *\\\\(>= \\\\d+.\\\\d+.\\\\d+\\\\)', dep) # rm <- regmatches(dep, m) # rvers <- gsub('.*(\\\\d+.\\\\d+.\\\\d+).*', '\\\\1', dep) # Function for TOC # https://gist.github.com/gadenbuie/c83e078bf8c81b035e32c3fc0cf04ee8
Author: Ludvig R. Olsen ( r-pkgs@ludvigolsen.dk )
License: MIT
Started: October 2016
R package for dividing data into groups.
|Function |Description |
|:---------------------|:------------------------------------------------------------------|
|group_factor()
|Divides data into groups by a wide range of methods. |
|group()
|Creates grouping factor and adds to the given data frame. |
|splt()
|Creates grouping factor and splits the data by these groups. |
|partition()
|Splits data into partitions. Balances a given categorical variable and/or numerical variable between partitions and keeps all data points with a shared ID in the same partition. |
|fold()
|Creates folds for (repeated) cross-validation. Balances a given categorical variable and/or numerical variable between folds and keeps all data points with a shared ID in the same fold. |
|collapse_groups()
|Collapses existing groups into a smaller set of groups with categorical, numerical, ID, and size balancing. |
|balance()
|Uses up- and/or downsampling to equalize group sizes. Can balance on ID level. See wrappers: downsample()
, upsample()
.|
|Function |Description |
|:-------------------------|:------------------------------------------------------------------|
|all_groups_identical()
|Checks whether two grouping factors contain the same groups, memberwise.|
|differs_from_previous()
|Finds values, or indices of values, that differ from the previous value by some threshold(s).|
|find_starts()
|Finds values or indices of values that are not the same as the previous value.|
|find_missing_starts()
|Finds missing starts for the l_starts
method.|
|summarize_group_cols()
|Calculates summary statistics about group columns (i.e. factor
s).|
|summarize_balances()
|Summarizes the balances of numeric, categorical, and ID columns in and between groups in one or more group columns. |
|ranked_balances()
|Extracts the standard deviations from the Summary
data frame from the output of summarize_balances()
|
|%primes%
|Finds remainder for the primes
method. |
|%staircase%
|Finds remainder for the staircase
method.|
groupdata2:::render_toc("README.Rmd")
CRAN version:
install.packages("groupdata2")
Development version:
install.packages("devtools")
devtools::install_github("LudvigOlsen/groupdata2")
groupdata2
contains a number of vignettes with relevant use cases and descriptions:
vignette(package = "groupdata2")
# for an overview
vignette("introduction_to_groupdata2")
# begin here
# Attach packages library(groupdata2) library(dplyr) # %>% filter() arrange() summarize() library(knitr) # kable()
# Create small data frame df_small <- data.frame( "x" = c(1:12), "species" = rep(c('cat', 'pig', 'human'), 4), "age" = sample(c(1:100), 12), stringsAsFactors = FALSE )
# Create medium data frame df_medium <- data.frame( "participant" = factor(rep(c('1', '2', '3', '4', '5', '6'), 3)), "age" = rep(c(20, 33, 27, 21, 32, 25), 3), "diagnosis" = factor(rep(c('a', 'b', 'a', 'b', 'b', 'a'), 3)), "diagnosis2" = factor(sample(c('x','z','y'), 18, replace = TRUE)), "score" = c(10, 24, 15, 35, 24, 14, 24, 40, 30, 50, 54, 25, 45, 67, 40, 78, 62, 30)) df_medium <- df_medium %>% arrange(participant) df_medium$session <- rep(c('1','2', '3'), 6)
Returns a factor with group numbers, e.g. factor(c(1,1,1,2,2,2,3,3,3))
.
This can be used to subset, aggregate, group_by, etc.
Create equally sized groups by setting force_equal = TRUE
Randomize grouping factor by setting randomize = TRUE
# Create grouping factor group_factor( data = df_small, n = 5, method = "n_dist" )
Creates a grouping factor and adds it to the given data frame. The data frame is grouped by the grouping factor for easy use in magrittr
(%>%
) pipelines.
# Use group() group(data = df_small, n = 5, method = 'n_dist') %>% kable()
# Use group() in a pipeline # Get average age per group df_small %>% group(n = 5, method = 'n_dist') %>% dplyr::summarise(mean_age = mean(age)) %>% kable()
# Using group() with 'l_starts' method # Starts group at the first 'cat', # then skips to the second appearance of "pig" after "cat", # then starts at the following "cat". df_small %>% group(n = list("cat", c("pig", 2), "cat"), method = 'l_starts', starts_col = "species") %>% kable()
Creates the specified groups with group_factor()
and splits the given data by the grouping factor with base::split
. Returns the splits in a list.
splt(data = df_small, n = 3, method = 'n_dist') %>% kable()
Creates (optionally) balanced partitions (e.g. training/test sets). Balance partitions on categorical variable(s) and/or a numerical variable. Make sure that all datapoints sharing an ID is in the same partition.
# First set seed to ensure reproducibility set.seed(1) # Use partition() with categorical and numerical balancing, # while ensuring all rows per ID are in the same partition df_partitioned <- partition( data = df_medium, p = 0.7, cat_col = 'diagnosis', num_col = "age", id_col = 'participant' ) df_partitioned %>% kable()
Creates (optionally) balanced folds for use in cross-validation. Balance folds on categorical variable(s) and/or a numerical variable. Ensure that all datapoints sharing an ID is in the same fold. Create multiple unique fold columns at once, e.g. for repeated cross-validation.
# First set seed to ensure reproducibility set.seed(1) # Use fold() with categorical and numerical balancing, # while ensuring all rows per ID are in the same fold df_folded <- fold( data = df_medium, k = 3, cat_col = 'diagnosis', num_col = "age", id_col = 'participant' ) # Show df_folded ordered by folds df_folded %>% arrange(.folds) %>% kable()
# Show distribution of diagnoses and participants df_folded %>% group_by(.folds) %>% count(diagnosis, participant) %>% kable()
# Show age representation in folds # Notice that we would get a more even distribution if we had more data. # As age is fixed per ID, we only have 3 ages per category to balance with. df_folded %>% group_by(.folds) %>% summarize(mean_age = mean(age), sd_age = sd(age)) %>% kable()
Notice, that the we now have the opportunity to include the session variable and/or use participant as a random effect in our model when doing cross-validation, as any participant will only appear in one fold.
We also have a balance in the representation of each diagnosis, which could give us better, more consistent results.
Collapses a set of groups into a smaller set of groups while attempting to balance the new groups by specified numerical columns, categorical columns, level counts in ID columns, and/or the number of rows.
# We consider each participant a group # and collapse them into 3 new groups # We balance the number of levels in diagnosis2 column, # as this diagnosis is not constant within the participants df_collapsed <- collapse_groups( data = df_medium, n = 3, group_cols = 'participant', cat_cols = 'diagnosis2', num_cols = "score" ) # Show df_collapsed ordered by new collapsed groups df_collapsed %>% arrange(.coll_groups) %>% kable() # Summarize the balances of the new groups coll_summ <- df_collapsed %>% summarize_balances(group_cols = '.coll_groups', cat_cols = "diagnosis2", num_cols = "score") coll_summ$Groups %>% kable() coll_summ$Summary %>% kable() # Check the across-groups standard deviations # This is a measure of how balanced the groups are (lower == more balanced) # and is especially useful when comparing multiple group columns coll_summ %>% ranked_balances() %>% kable()
Recommended: By enabling the auto_tune
setting, we often get a much better balance.
Uses up- and/or downsampling to fix the group sizes to the min, max, mean, or median group size or to a specific number of rows. Balancing can also happen on the ID level, e.g. to ensure the same number of IDs in each category.
# Lets first unbalance the dataset by removing some rows df_b <- df_medium %>% arrange(diagnosis) %>% filter(!row_number() %in% c(5,7,8,13,14,16,17,18)) # Show distribution of diagnoses and participants df_b %>% count(diagnosis, participant) %>% kable()
# First set seed to ensure reproducibility set.seed(1) # Downsampling by diagnosis balance( data = df_b, size = "min", cat_col = "diagnosis" ) %>% count(diagnosis, participant) %>% kable()
# Downsampling the IDs balance( data = df_b, size = "min", cat_col = "diagnosis", id_col = "participant", id_method = "n_ids" ) %>% count(diagnosis, participant) %>% kable()
There are currently 10 methods available. They can be divided into 6 categories.
Examples of group sizes are based on a vector with 57 elements.
Divides up the data greedily given a specified group size.
E.g. group sizes: 10, 10, 10, 10, 10, 7
Divides the data into a specified number of groups and distributes excess data points across groups.
E.g. group sizes: 11, 11, 12, 11, 12
Divides the data into a specified number of groups and fills up groups with excess data points from the beginning.
E.g. group sizes: 12, 12, 11, 11, 11
Divides the data into a specified number of groups. The algorithm finds the most equal group sizes possible, using all data points. Only the last group is able to differ in size.
E.g. group sizes: 11, 11, 11, 11, 13
Divides the data into a specified number of groups. Excess data points are placed randomly in groups (only 1 per group).
E.g. group sizes: 12, 11, 11, 11, 12
Uses a list / vector of group sizes to divide up the data.
Excess data points are placed in an extra group.
E.g. n = c(11, 11)
returns group sizes: 11, 11, 35
Uses a list of starting positions to divide up the data.
Starting positions are values in a vector (e.g. column in data frame).
Skip to a specific nth appearance of a value by using c(value, skip_to)
.
E.g. n = c(11, 15, 27, 43)
returns group sizes: 10, 4, 12, 16, 15
Identical to n = list(11, 15, c(27, 1), 43
where 1
specifies that we
want the first appearance of 27 after the previous value 15.
If passing n = "auto"
starting positions are automatically found with find_starts()
.
Every n
th data point is combined to a group.
E.g. group sizes: 12, 12, 11, 11, 11
Uses step_size to divide up the data. Group size increases with 1 step for every group, until there is no more data.
E.g. group sizes: 5, 10, 15, 20, 7
Creates groups with sizes corresponding to prime numbers.
Starts at n
(prime number). Increases to the the next prime number until there is no more data.
E.g. group sizes: 5, 7, 11, 13, 17, 4
There are currently 4 methods for balancing (up-/downsampling) on ID level in balance()
.
Balances on ID level only. It makes sure there are the same number of IDs in each category. This might lead to a different number of rows between categories.
Attempts to level the number of rows per category, while only removing/adding entire IDs. This is done with repetition and by iteratively picking the ID with the number of rows closest to the lacking/excessive number of rows in the category.
Distributes the lacking/excess rows equally between the IDs. If the number to distribute cannot be equally divided, some IDs will have 1 row more/less than the others.
Balances the IDs within their categories, meaning that all IDs in a category will have the same number of rows.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.