View source: R/stratified_sample.R
| stratify | R Documentation |
Obtains a proportional stratified sample from any data convertible to "data.table" class containing categorical variables.
stratify(
X,
target,
stratum = NULL,
size,
thresh,
seed = NULL,
indx = TRUE,
dis = NULL,
args = list(),
ext = FALSE,
replace = FALSE,
verbose = TRUE
)
X |
any data array convertible to "data.table" class |
target |
character length 1. The name of column considered to be the root stratum. For example, the name of the 'target' categorical column in a classification training set. This argument should always have a value |
stratum |
character of length <= |
size |
integer length 1. Default, none. Value set by User. In this case, it is upper-bounded by the size of the
thinnest stratum having more than one row. Setting |
thresh |
integer, length 1. Default, none. An automatic switch between sample size calculation formulae.
Can be set when NOTE: it is recommended that both |
seed |
integer length 1. Seed value for output reproducibility |
indx |
logical. Default TRUE, returns the sample row index only. FALSE, returns the sampled data |
dis |
symbol. Default NULL. One of the density or function distributions used for generating probability vectors for probabilistic sampling |
args |
list of arguments required by distributions as described in stats::distributions documentation. Default, none. NB The list should never include the first argument (x or n) required in documentation, as it is collected internally from each stratum NOTE: Even if |
ext |
logical, default FALSE. When TRUE, expands the output sampled data with the following extra columns:
row - sample rows, strat - stratum, n - stratum total rows (i.e. thickness)
and size - the sample size extracted from each stratum. Requires |
replace |
logical, default FALSE. When TRUE, sampling with replacement if |
verbose |
logical, default TRUE, display messages |
This utility is designed to find a true sample representation of the data under current stratification
by matching closely the proportionality of strata as long as argument size is missing from call.
Each distinct combination of target and stratum levels defines a stratum. For minimal
stratification, argument target must always have a value present in call. All one-row strata, when
formed, are simply appended to the compounded output.
size. As column in the extended output, it represents the size of the sample extracted from each
stratum, internally derived to be proportional to stratum thickness, unbounded by the thinnest stratum with more
than one row. Deep stratification along with high cardinality and imbalance may severely restrict the size of the
compounded output which is the sum of all stratum sizes plus the number of one-row strata. The sampling occurs at
stratum level except for one-row strata for which size = 0 is interpreted as "no sampling".
As function argument, size is interpreted as the largest sample size without replacement that can be requested,
being bounded by the thinnest stratum with more than one row. The presence of size in call alters
the proportionality since each stratum - except one-row strata - contributes equally to the output size which is
the number of strata times the size value plus the number of one-row strata.
thresh. Automatic switch that modifies stratum sample size calculation method based on the extreme stratum
thickness values, stratification depth and total data rows. Internally, it searches for the formula that finds
at least one sample size accommodating the thinnest stratum with more than one row. Messages are displayed at runtime
although, in most cases the condition is satisfyed at first iteration. When thresh >= nrow(data), each stratum
is sampled proportional with the ratio between thinnest and thickest strata, which may lead to a relatively small size
output. All other thresh values compromise slightly between output size and proportionality (see Example 3).
dis. The prob argument in base::sample cannot be used as required since the length of probability vector
varies with stratum thickness. Herein, stratum probability vectors are determined by the distribution specified in
argument dis which associates each stratum with a probability vector of thickness length. When args is
missing from call, dis uses the default argument values for respective distribution. An error is thrown when the
probability vector has insufficient number of non-zero values. See package stats, "Distributions" documentation.
NOTE: The random variate generators i.e. the r* version of distributions, generate vectors of absolute random deviate values which play the role of pseudo-probabilities conformant with the requirements listed in base::sample documentation.
A proportional or non-proportional stratified sample (depending on whether size is absent or present
in call), either as row index or as sampled data, compounded from random or probability samples taken from each
stratum. Informative messages are displayed. Existing data row names are preserved in the output case in which, the sampled
data output gains the column named "rn".
sample, distributions
if (interactive()) {
# 1. Row index for sampling
data(mtcars)
rowID = stratify(mtcars
, target = 'cyl'
, stratum = c('vs', 'am')
, seed = 314) # display information
print(rowID) # integer
# 2. Sampled data with extra-columns
smp = stratify(mtcars
, 'cyl'
, c('vs', 'am')
, seed = 314
, indx = FALSE
, ext = TRUE) # extra columns
print(smp)
identical(rowID, smp$row) # TRUE
# 3. Impact of "thresh" value on output size
sl = list()
thresholds = c(2, 4, 12, 32) # stratum thicknesses
for (t in seq(along=thresholds)) {
sl[[t]] = stratify(mtcars
, 'cyl'
, c('am', 'vs')
, thresh = thresholds[t]
, seed = 314
, indx = FALSE, ext = TRUE)
}
names(sl) = quote(thresholds)
print(sl) # stratified samples
# of various sizes
# 4. Probabilistic sampling
rowIDn = stratify(mtcars
, 'cyl'
, c('vs', 'am')
, seed = 314
, dis = pnorm # Normal distribution
, args = c(mean = 1, sd = 3)) # no first argument!
rowIDb = stratify(mtcars
, 'cyl'
, c('vs', 'am')
, seed = 314 # same seed
, dis = pbeta # Beta distribution
, args = c(shape1 = 1, shape2 = 3)) # no first argument!
# Same seed but changing the distribution changes the sample row index
identical(rowIDn, rowIDb) # FALSE
}
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.