stratify: Extract A Proportional Stratified Sample From A Data Set

View source: R/stratified_sample.R

stratifyR Documentation

Extract A Proportional Stratified Sample From A Data Set

Description

Obtains a proportional stratified sample from any data convertible to "data.table" class containing categorical variables.

Usage

stratify(
  X,
  target,
  stratum = NULL,
  size,
  thresh,
  seed = NULL,
  indx = TRUE,
  dis = NULL,
  args = list(),
  ext = FALSE,
  replace = FALSE,
  verbose = TRUE
)

Arguments

X

any data array convertible to "data.table" class

target

character length 1. The name of column considered to be the root stratum. For example, the name of the 'target' categorical column in a classification training set. This argument should always have a value

stratum

character of length <= ncol(data) - 1. Default, NULL. Names of additional categorical data columns which deepen the stratification

size

integer length 1. Default, none. Value set by User. In this case, it is upper-bounded by the size of the thinnest stratum having more than one row. Setting size value above this bound requires sampling with replacement

thresh

integer, length 1. Default, none. An automatic switch between sample size calculation formulae. Can be set when size is missing from call. It can take as value any of the stratum thicknesses shown in the output message

NOTE: it is recommended that both size and thresh values are missing from call until information on stratification becomes available after first run

seed

integer length 1. Seed value for output reproducibility

indx

logical. Default TRUE, returns the sample row index only. FALSE, returns the sampled data

dis

symbol. Default NULL. One of the density or function distributions used for generating probability vectors for probabilistic sampling

args

list of arguments required by distributions as described in stats::distributions documentation. Default, none. NB The list should never include the first argument (x or n) required in documentation, as it is collected internally from each stratum

NOTE: Even if seed is set, the sample row index changes if either the distribution in dis or the values in args is changed

ext

logical, default FALSE. When TRUE, expands the output sampled data with the following extra columns: row - sample rows, strat - stratum, n - stratum total rows (i.e. thickness) and size - the sample size extracted from each stratum. Requires indx = FALSE

replace

logical, default FALSE. When TRUE, sampling with replacement if size is present in call and exceeds the thinnest stratum with more than one row

verbose

logical, default TRUE, display messages

Details

This utility is designed to find a true sample representation of the data under current stratification by matching closely the proportionality of strata as long as argument size is missing from call. Each distinct combination of target and stratum levels defines a stratum. For minimal stratification, argument target must always have a value present in call. All one-row strata, when formed, are simply appended to the compounded output.

size. As column in the extended output, it represents the size of the sample extracted from each stratum, internally derived to be proportional to stratum thickness, unbounded by the thinnest stratum with more than one row. Deep stratification along with high cardinality and imbalance may severely restrict the size of the compounded output which is the sum of all stratum sizes plus the number of one-row strata. The sampling occurs at stratum level except for one-row strata for which size = 0 is interpreted as "no sampling".

As function argument, size is interpreted as the largest sample size without replacement that can be requested, being bounded by the thinnest stratum with more than one row. The presence of size in call alters the proportionality since each stratum - except one-row strata - contributes equally to the output size which is the number of strata times the size value plus the number of one-row strata.

thresh. Automatic switch that modifies stratum sample size calculation method based on the extreme stratum thickness values, stratification depth and total data rows. Internally, it searches for the formula that finds at least one sample size accommodating the thinnest stratum with more than one row. Messages are displayed at runtime although, in most cases the condition is satisfyed at first iteration. When thresh >= nrow(data), each stratum is sampled proportional with the ratio between thinnest and thickest strata, which may lead to a relatively small size output. All other thresh values compromise slightly between output size and proportionality (see Example 3).

Probabilistic Sampling

dis. The prob argument in base::sample cannot be used as required since the length of probability vector varies with stratum thickness. Herein, stratum probability vectors are determined by the distribution specified in argument dis which associates each stratum with a probability vector of thickness length. When args is missing from call, dis uses the default argument values for respective distribution. An error is thrown when the probability vector has insufficient number of non-zero values. See package stats, "Distributions" documentation.

NOTE: The random variate generators i.e. the r* version of distributions, generate vectors of absolute random deviate values which play the role of pseudo-probabilities conformant with the requirements listed in base::sample documentation.

Value

A proportional or non-proportional stratified sample (depending on whether size is absent or present in call), either as row index or as sampled data, compounded from random or probability samples taken from each stratum. Informative messages are displayed. Existing data row names are preserved in the output case in which, the sampled data output gains the column named "rn".

See Also

sample, distributions

Examples


if (interactive()) {

# 1. Row index for sampling

data(mtcars)
rowID = stratify(mtcars
               , target = 'cyl'
               , stratum = c('vs', 'am')
               , seed = 314)                                  # display information
print(rowID)                                                  # integer

# 2. Sampled data with extra-columns

smp = stratify(mtcars
            , 'cyl'
            , c('vs', 'am')
            , seed = 314
            , indx = FALSE
            , ext = TRUE)                                     # extra columns
print(smp)
identical(rowID, smp$row)                                     # TRUE

# 3. Impact of "thresh" value on output size

sl = list()
thresholds = c(2, 4, 12, 32)                                  # stratum thicknesses

for (t in seq(along=thresholds)) {
                  sl[[t]] = stratify(mtcars
                                  , 'cyl'
                                  , c('am', 'vs')
                                  , thresh = thresholds[t]
                                  , seed = 314
                                  , indx = FALSE, ext = TRUE)
                }
names(sl) = quote(thresholds)
print(sl)                                                     # stratified samples
                                                              # of various sizes

# 4. Probabilistic sampling

rowIDn = stratify(mtcars
             , 'cyl'
             , c('vs', 'am')
             , seed = 314
             , dis = pnorm                                    # Normal distribution
             , args = c(mean = 1, sd = 3))                    # no first argument!
rowIDb = stratify(mtcars
             , 'cyl'
             , c('vs', 'am')
             , seed = 314                                     # same seed
             , dis = pbeta                                    # Beta distribution
             , args = c(shape1 = 1, shape2 = 3))              # no first argument!

# Same seed but changing the distribution changes the sample row index
identical(rowIDn, rowIDb)                                     # FALSE

}


akin documentation built on May 19, 2026, 5:07 p.m.