split_strata: Split Strata

View source: R/split_strata.R

split_strataR Documentation

Split Strata

Description

Splits pre-defined sampling strata based on values of a continuous or categorical variable.

Usage

split_strata(
  data,
  strata,
  split = NULL,
  split_var,
  type = "global quantile",
  split_at = 0.5,
  trunc = NULL
)

Arguments

data

a dataframe or matrix with one row for each sampling unit, one column specifying each unit's current stratum, one column containing the continuous or categorical values that will define the split, and any other relevant columns.

strata

a character string specifying the name of the column that defines each unit's current strata.

split

the name of the stratum or strata to be split, exactly as they appear in strata. Defaults to NULL, which indicates that all strata in strata will be split.

split_var

a character string specifying the name of the column that should be used to define the strata splits.

type

a character string specifying how the function should interpret the split_at argument. Must be one of:

  • "global quantile", the default, splits the strata at the quantiles specified in split_at defined along the entire, unfiltered split_var column.

  • "local quantile" splits the strata at the quantiles specified in split_at defined along the filtered split_var column which only includes units in the stratum being split.

  • "value" splits the strata at the values specified in split_at along split_var column.

  • "categorical" splits the strata into two new strata, one that contains each unit where split_var matches an input of split_at, and a second that contains every other unit.

split_at

the percentile, value, or name(s) which split_var should be split at. The interpretation of this input depends on type. For "quantile" types, input must be between 0 and 1. Defaults to 0.5 (median). For "categorical" type, the input should be a vector of values or names in split_var that define the new stratum.

trunc

A numeric or character value specifying how the name of the split_var should be truncated when naming the new strata. If numeric, the new strata name will only include the first 'n' characters of the split_var name. If character, the specified string will be used to name the new strata instead of the split_var name. Defaults to NULL, which creates the new strata name using the entire name of the split_var column.

Details

For splits on continuous variables, the new strata are defined on left-open intervals. The only exception is the first interval, which must include the overall minimum value. The names of the newly created strata for a split generated from a continuous value are the split_var column name with the range of values defining that stratum appended to the old strata name. For a categorical split, the new strata names are the split_var column name appended to the 1/0 logical flag specifying whether the unit is in split at, all appended to the old strata name. If the split_var column name is long, the user can specify a value for trunc to prevent the new strata names from being inconveniently long.

Value

Returns the input dataframe with a new column named 'new_strata' that holds the name of the stratum that each sample belongs to after the split. The column containing the previous strata names is retained and given the name "old_strata".

Examples

x <- split_strata(iris, "Sepal.Length",
  strata = c("Species"),
  split = "setosa", split_var = "Sepal.Width",
  split_at = c(0.5), type = "global quantile"
)

# You can split at more than one quantile in one call.
# The above call splits the "setosa" stratum into three of equal size
x <- split_strata(iris, "Sepal.Length",
  strata = c("Species"),
  split = "setosa", split_var = "Sepal.Width", split_at = c(0.33, 0.66),
  type = "local quantile"
)

# Manually select split values with type = "value"
x <- split_strata(iris, "Sepal.Length",
  strata = "Species",
  split = "setosa", split_var = "Sepal.Width",
  split_at = c(3.1, 3.8), type = "value"
)

# Perform a categorical split.
iris$strata <- rep(c(rep(1, times = 25), rep(0, times = 25)), times = 3)
x <- split_strata(iris, "Sepal.Length",
  strata = "strata",
  split = NULL, split_var = "Species",
  split_at = c("virginica", "versicolor"), type = "categorical"
)
# Splits each initial strata 1 and 2 into one stratum with "virginia"
# and "versicolor" species and one stratum with all of the other species
# not specified in the split_at argument.

optimall documentation built on Sept. 8, 2023, 6:07 p.m.