knitr::opts_chunk$set( collapse = TRUE, comment = "#>" ) library(optimall) library(DiagrammeR) library(survey) library(datasets) library(dplyr) data(MatWgt_Sim, package = "optimall")
When a study population is composed of heterogeneous subpopulations, stratified random sampling techniques are often employed to obtain more precise estimates of population characteristics. Efficiently allocating samples to strata under this method is a crucial step in the study design, especially when sampling is expensive.
optimall
offers a collection of functions that are designed to streamline the process of optimum sample allocation, for a single wave or an adaptive, multi-wave approach. Its main functions allow users to:
Define, split, and merge strata based on values or percentiles of other variables.
Calculate the optimum number of samples to allocate to each stratum in a given study in order to minimize the variance of the target sample mean.
Select specific IDs to sample based on a stratified sampling design.
Optimally allocate a fixed number of samples to a sampling wave based on results from a prior wave.
When used together, these functions can automate most of the sampling workflow. This vignette will outline these package features one by one (moving from simple to more complex), introduce the theoretical framework behind the functions, and finally demonstrate how they can be used collectively to perform multi-wave sampling in R. The final section then details how to use optimall_shiny()
to efficiently make decisions about splitting strata.
In stratified sampling, strata are typically defined on values or quantiles of inexpensive variables available for the entire population. Given a dataset with one row per sampling unit and at least one column containing a variable that can be used to construct strata, optimall
allows users to easily define, split, and merge strata.
To demonstrate the simple process of refining strata in optimall
, we use the iris
dataset from the R package datasets
. Suppose that we have defined three strata based on "Species"
, but we want to split two out of the three, setosa and virginica, in half based on the within-stratum median of "Sepal.Width"
. In optimall
, we can do this by calling the split_strata()
function:
library(optimall) library(datasets) iris <- datasets::iris table(iris$Species) iris2 <- split_strata(data = iris, strata = "Species", split = c("setosa", "virginica"), split_var = "Sepal.Width", split_at = c(0.5), type = "local quantile") table(iris2$new_strata)
Similarly, we can merge strata using the function merge_strata()
:
iris3 <- merge_strata(data = iris, strata = "Species", merge = c("setosa", "versicolor"), name = "set_or_vers") table(iris3$new_strata)
The split_strata
and merge_strata()
functions are quite versatile. See function documentation for more information.
Assuming that the per-unit sampling cost is the same in each stratum and that $S_h$, the standard deviation of the variable of interest within each stratum, can be estimated, Neyman (1934) presented the following solution to optimally allocate $n$ samples among $H$ strata in order to minimize the variance of the sample mean:
$$n_h = n \frac{N_hS_h}{\sum_{i=1}^H N_iS_i}.$$
This formula is known as Neyman allocation and is one of the functions available in optimall
. Neyman allocation offers the advantage of outputting sampling fractions that can later be multiplied by $n$ or taken on their own if $n$ is not known.
While Neyman allocation has a strong theoretical backing, Wright (2014) points out some limitations that make it sub-optimal in practice.
Wright offers alternative algorithms that solve the optimal problem for a discrete allocation that sums exactly to $n$, which can also be implemented in optimall
. Essentially, his approaches use linear constraints to optimize the allocation of samples over a space of integer values. He makes use of within-stratum variance and population stratum size to generate priority values, which in turn dictate how many samples should be taken from each stratum. To learn more about the specifics of these algorithms, see Wright (2014).
The optimall
package allows users to select between Neyman allocation, Wright Algorithm I, and Wright Algorithm II in the method
argument of the optimum_allocation()
function. The optimum_allocation()
function defaults to using Wright Algorithm II. This algorithm requires that at least 2 samples are taken from each stratum. In optimall
, stratum sampling sizes for both Wright algorithms are also constrained from above at $N_h$, the population stratum size, using the methods for constraints that Wright details in Algorithm III.
Below is an example of how optimum_allocation()
could be called to optimally allocate 40 samples among "Species"
in the iris
dataset, minimizing the variance of the "Sepal.Width"
sample mean.
sampling_design <- optimum_allocation(data = iris, strata = "Species", y = "Sepal.Width", nsample = 40, method = "WrightII") sampling_design
If we have a dataframe that holds the $N_h$ and $sd_h$ for each stratum instead of data for each individual unit, we can still use optimum_allocation()
:
iris_summary <- data.frame(strata = unique(iris$Species), size = c(50, 50, 50), sd = c(0.3791, 0.3138, 0.3225)) optimum_allocation(data = iris_summary, strata = "strata", sd_h = "sd", N_h = "size", nsample = 40, method = "WrightII")
With the number of units to sample per stratum specified in a dataframe, optimall
can select the IDs of the units to be sampled using simple random sampling within strata with the function sample_strata()
:
iris$id <- 1:150 set.seed(743) iris <- sample_strata(data = iris, strata = "Species", id = "id", design_data = sampling_design, design_strata = "strata", n_allocated = "stratum_size")
The output of sample_strata()
is the same input dataframe with a new column called "sample_indicator"
that holds a binary indicator for whether each unit should be sampled in the specified wave:
head(iris)
From this dataframe, we can easily extract a vector of the ids to sample:
ids_to_sample <- iris$id[iris$sample_indicator == 1] head(ids_to_sample) length(ids_to_sample)
Note that design_data
is the output of optimum_allocation()
in this example, but in practice it can be any dataframe that has one row per stratum and one column specifying each stratum's desired sample size. As such, any method of allocating samples to strata can be implemented by sample_strata()
as long as the design dataframe is specified and simple random sampling within strata is used.
When measuring variables of interest is expensive or difficult, it can be advantageous to employ a multi-phase design, where cheaper variables are collected from the entire population and expensive variables are collected from subsamples selected through adaptive, multi-wave sampling. This approach, which is documented in McIsaac and Cook (2015), involves multiple waves of sampling where information from prior waves is used to inform the optimum sampling design of subsequent ones. With the understanding that both Neyman and Wright's optimum allocation methods depend heavily on standard deviation estimates for the variable of interest, the benefit of multi-wave sampling is clear to see. Sample allocations based on prior waves will use updated estimates of nuisance parameters that incorporate data accumulated in the prior sampling waves. The optimal sampling proportion is updated at the end of each wave, and this update guides the sampling of the next wave. As the phase II data accumulate, the necessary within-strata SD estimates, and thus the estimated optimal sampling proportions, are expected to be closer to their true value.
In the design described by McIsaac and Cook, a large phase-I sample is first taken to measure the inexpensive covariates and/or outcome. The results of this phase will define strata which are then sampled non-optimally (through proportional or balanced sampling) for measurement of the expensive variable in phase-IIa. The phase-IIa results are used to estimate the standard deviation required to optimally allocate the next wave of samples. This process is iterated until the desired sample size for the variable of interest is achieved. Below is an outline of the workflow, which is facilitated by optimall
:
{width=55%}
DiagrammeR::grViz("digraph { graph [layout = dot, rankdir = TB] node [shape = rectangle, fixedsize = true, width = 8.5, height = 1.2, fontname = Helvetica, fontsize = 20] rec1 [label = <<b>Phase-I:</b><br/>A large sample of inexpensive variables, <br/> define strata based on results.>] rec2 [label = <<b>Phase-IIa:</b><br/>Initial stratified sample of expensive variable taken without <br/>optimum allocation>] rec3 [label = <<b>Phase-IIb:</b><br/>Second sample of expensive variable with optimum allocation <br/> based on Phase-IIa results>] rec4 [label = <<b>Phase-IIc - Phase-IIfinal:</b><br/>Combine previous waves of Phase-II together and repeat <br/> sampling with re-calculated optimum allocation until desired total <br/> sample size for expensive variable is reached at Phase-IIfinal>] # edge definitions with the node IDs rec1 -> rec2 -> rec3 -> rec4 }", height = 250, width = 700)
optimall
allows users to input phase-I data and iteratively allocate samples for subsequent waves with the function allocate_wave()
. This function runs integer-valued optimum_allocation()
on a dataset according to Wright's Algorithm II, but it takes into account previous sampling waves to determine how the current wave of samples should be allocated.
For a simple example, suppose that we again want to allocate 40 samples to minimize the variance of "Sepal.Width"
in the iris
dataset, but 30 out of the 40 have already been sampled. We assume that the strata are still defined only by "Species"
, and that 16 of the first 30 samples were taken from the virginica species, 7 from setosa, and 7 from versicolor. Assuming we only have those 30 samples to base our within-stratum standard deviation estimates on, we can allocate the next 10 samples using allocate_wave()
:
# Set up Wave 1 wave1_design <- data.frame(strata = c("setosa", "virginica", "versicolor"), stratum_size = c(7, 16, 7)) # Collect Sepal.Width from only the 30 samples phase1_data <- subset(datasets::iris, select = -Sepal.Width) phase1_data$id <- 1:nrow(phase1_data) #Add id column set.seed(234) phase1_data <- sample_strata(data = phase1_data, strata = "Species", id = "id", design_data = wave1_design, design_strata = "strata", n_allocated = "stratum_size") wave1_ids <- iris$id[phase1_data$sample_indicator == 1] wave1_sampled_data <- iris[iris$id %in% wave1_ids, c("id","Sepal.Width")] wave1_data <- merge(phase1_data, wave1_sampled_data, by = "id", no.dups = TRUE, all.x = TRUE) # We have our 30 samples table(is.na(wave1_data$Sepal.Width), wave1_data$Species) # Now, allocate the next 10: wave2_design <- allocate_wave(data = wave1_data, strata = "Species", y = "Sepal.Width", already_sampled = "sample_indicator", nsample = 10, detailed = TRUE) wave2_design
Notice that in this case, "nsample_optimal"
does not match "nsample_actual"
for every stratum in the outputted design dataframe. This occurred because we oversampled from the virginica stratum in wave 1, meaning that the optimum stratum sample size among 40 total samples was smaller than the amount of samples already taken in that group. When oversampling occurs, allocate_wave()
recalculates the optimum allocation among the non-oversampled strata and uses that result to allocate the specified number of samples as optimally as possible. When no oversampling occurs, "nsample_optimal"
will match "nsample_actual"
for every stratum.
Now we can easily get the 10 new ids to sample using the wave2_design
as the design_data
in sample_strata()
. Notice that we do not have to specify design_strata
or n_allocated
because the default arguments are the column names of the allocate_wave()
output:
# Run sample_strata wave2_data <- sample_strata(data = wave1_data, strata = "Species", id = "id", already_sampled = "sample_indicator", design_data = wave2_design) # Extract the 10 ids to sample wave2_ids <- iris$id[wave2_data$sample_indicator == 1] wave2_ids
optimall
to Determine Study Design and Sample DataWe use the simulated dataset MatWgt_Sim
to demonstrate how the functions in optimall
can be used to carry out study design and sampling tasks in R. This section includes three examples, which build upon eachother during the design and execution of a multi-wave survey:
Example 1: Uses the function split_strata()
to define and refine strata.
Example 2: Uses the function sample_strata()
to randomly select units to be sampled based on a specified sampling design.
Example 3: Demonstrates how to conduct an adaptive, multi-wave sampling design that uses an optimal Neyman allocation scheme at each step to allocate the next wave across strata.
The dataset contains simulated data based on a real study of the association between maternal weight gain during pregnancy and the risk of childhood obesity after controlling for a number of clinical and demographic covariates. In this hypothetical example, the study data are obtained from electronic health records, which are known to be error-prone and do not have all the variables of interested available without manual chart review. In a perfect world, chart review would be used to validate and obtain the necessary variables on every observation in our sample, but chart review is an expensive and difficult task. Researchers determined that they could only reasonably afford to validate 750 out of the 10,335 child-mother pairs. We refer to the 10,335 as the phase 1 sample, for which we have error-prone observations on all, and we refer to the 750 as the phase II sample, which will be designed using tools in the optimall
package. For simplicity, suppose that we want to use these 750 samples to estimate the true population mean of maternal weight change during pregnancy with minimal variance. This goal will be accomplished in our last example through an adaptive, multi-wave sampling design.
The simulated data used in this example is included with optimall
. The dataset MatWgt_Sim
contains 10335 rows, one for each mother-child pair, and 6 columns containing ID numbers and covariates which we see below:
data(MatWgt_Sim, package = "optimall") head(MatWgt_Sim)
Three of the covariates, child race, diabetes, and estimated maternal weight, are inexpensive to collect for all subjects, but determining the true maternal weight requires a tedious validation process. This full version of the simulated dataset contains the true weight change for all 10335 mothers, which may or may not have a relationship with the other covariates.
For the purpose of our examples, we suppose that we do not have access to this full dataset. Instead, we must sample from it to optimally estimate the mean of the true maternal weight changes.
During phase-I, we assume that all 10335 mother-child pairs have been sampled for the inexpensive covariates. Accordingly, we define our phase 1 dataset to be 10335 x 5 with every column of the full MatWgt_Sim
data excluding the expensive true weight change. In the examples that follow, we will assume the validated maternal weight variable mat_weight_true
will only be available through phase 2 sampling.
# Get phase 1 data, remove mat_weight_true which we assume is unknown phase1 <- subset(MatWgt_Sim, select = -mat_weight_true) dim(phase1)
We suspect that the true mean maternal weight change may vary with some of the inexpensive covariates in the phase 1 dataset, so we decide to split our population into 9 non-overlapping strata. Each stratum will be defined by a unique combination of race and global percentile of maternal weight gain (≤25th, 25th - 75th, >75th). We can accomplish this quickly in optimall
using the split_strata()
function.
phase1$strata <- phase1$race #initialize a strata column first # Merge the smallest race categories phase1 <- merge_strata(data = phase1, strata = "race", merge = c("Other","Asian"), name = "Other") phase1 <- split_strata(data = phase1, strata = "new_strata", split = NULL, split_var = "mat_weight_est", type = "global quantile", split_at = c(0.25,0.75), trunc = "MWC_est") # Trunc argument specifies how to refer to mat_weight_est in new strata names
We now have the same phase1
data with a new column specifying the strata we have defined.
head(phase1) table(phase1$new_strata) # 9 strata
With the strata now defined by the inexpensive variables sampled in phase-I, we are ready to begin auditing patient records for validation of the true maternal weight changes. Without any validated data to define the optimum phase-IIa sample allocation, we decide to use proportional stratified sampling for the first 250 out of our 750 audits. Conveniently, optimall
can select this random sample for us with sample_strata()
.
The sample_strata()
function requires as input two dataframes, one containing the stratum membership of each individual unit (data
) and a second specifying the sampling design (design_data
). The design_data
dataframe must contain at least two columns:
design_strata
: A column holding strata namesn_allocated
: The total $n$ allocated to each stratum.In our example, we want the proportion of samples randomly taken from each stratum to reflect its population proportion and a total of 250 samples.
phase2a_design <- data.frame( strata_name = names(table(phase1$new_strata)), strata_prop = as.vector(table(phase1$new_strata))/10335, strata_n = round(250.3*as.vector(table(phase1$new_strata))/10335) ) # 250.3 to make sure 250 samples after rounding sum(phase2a_design$strata_n) phase2a_design
We can now call sample_strata()
to randomly draw the specified number samples from each stratum. It will output the same phase1
dataframe with an extra column indicating which units should be sampled. We can then extract the ids to sample based on this indicator.
phase1 <- sample_strata(data = phase1, strata = "new_strata", id = "id", design_data = phase2a_design, design_strata = "strata_name", n_allocated = "strata_n") ids_to_sample2a <- phase1[phase1$sample_indicator == 1,"id"] length(ids_to_sample2a) # 250 ids to sample
We submit these 250 ids for validation. Hypothetically, this involves the hard work of trained nurses, but for our example it involves only a few lines of code.
phase2a_samples <- MatWgt_Sim[MatWgt_Sim$id %in% ids_to_sample2a, c("id","mat_weight_true")] phase2a <- merge(phase1, phase2a_samples, by = "id", no.dups = TRUE, all.x = TRUE) names(phase2a) table(is.na(phase2a$mat_weight_true))
We notice that all of the units that were not validated have NA
values in the new columns. We will use the non-missing validated data to inform optimum allocation in our future sampling waves.
In phase-IIa, we allocated samples to strata using proportional sampling rather than optimum allocation because we had no prior samples of the true maternal weight change from which to estimate within-stratum variances. Now that we have completed the first wave of validation, we can estimate these variances and will thus use optimum allocation for each of the subsequent sampling waves.
In phase-IIb (wave 2), we will optimally allocate 250 more samples, raising our total number of validated samples to 500. The allocate_wave()
function makes this step simple by calculating the optimum allocation for 500 samples, determining how many units have already been sampled in previous waves (only Phase-IIa in this case), and allocating the 250 samples of the current wave to make up the difference. The output is a design dataframe summarizing the results for each stratum.
phase2b <- phase2a # Add indicator for units that were already sampled phase2b$already_sampled <- ifelse(phase2b$id %in% ids_to_sample2a, 1, 0) # Make design phase2b_design <- allocate_wave(data = phase2b, strata = "new_strata", y = "mat_weight_true", already_sampled = "already_sampled", nsample = 250, detailed = TRUE) phase2b_design
Looking at the output of allocate_wave()
, we notice that the optimum sample sizes for the next wave vary greatly between strata, but all are greater than one. Because we set detailed = TRUE
, we can also see what the optimum allocation for 500 samples would have been ignoring prior wave sizes in the column "nsample_optimal"
. This column matches "nsample_actual"
because we did not oversample any strata in phase-IIa.
In other cases when "n_to_sample"
= 0 for some strata, we may not be so lucky. If the optimum sample size in a stratum is smaller than the amount it was allocated in previous waves ("nsample_prior"
), we say that that stratum has been oversampled. When oversampling occurs, allocate_wave()
"closes" the oversampled strata and re-allocates the remaining samples optimally among the open strata. Under these circumstances, the total sampling allocation is no longer optimal, but optimall
will output the most optimal allocation from the non-oversampled strata for the next wave.
Although we don't see evidence of oversampling, these results suggest that our strata are relatively imbalanced. We decide to split the largest stratum into three similarly-sized strata along mat_weight_est using the local quantile
option of split_strata()
.
phase2b <- split_strata(data = phase2b, split = "White.MWC_est_(9.75,15.06]", strata = "new_strata", split_var = "mat_weight_est", type = "local quantile", split_at = c(1/3,2/3))
With our strata now more balanced in size, we can re-run allocate_wave()
to make a new design dataframe for sample_strata()
to use to determine which ids to sample in phase-IIb.
phase2b_design <- allocate_wave(data = phase2b, strata = "new_strata", y = "mat_weight_true", already_sampled = "already_sampled", nsample = 250) phase2b_design # Extract IDs to sample ids_to_sample2b <- sample_strata(data = phase2b, strata = "new_strata", id = "id", already_sampled = "already_sampled", design_data = phase2b_design, design_strata = "strata", n_allocated = "n_to_sample") ids_to_sample2b <- ids_to_sample2b[ids_to_sample2b$sample_indicator == 1,"id"] length(ids_to_sample2b)
We can now get the validated maternal weight changes for these 250 subjects.
# Take samples from true data phase2b_samples <- MatWgt_Sim[MatWgt_Sim$id %in% ids_to_sample2b, c("id","mat_weight_true")] # Merge extracted samples with our data. # Note we use dplyr's coalesce here to merge sampled mat_weight_true with # the mat_weight_true solumn already present in phase2b, but manner of # merging of samples may vary from project to project. # For simpler merges, see 'Multiwave Object' vignette. phase2b <- merge(phase2b, phase2b_samples, by = "id", no.dups = TRUE, all.x = TRUE) library(dplyr) phase2b$mat_weight_true <- dplyr::coalesce(phase2b$mat_weight_true.x, phase2b$mat_weight_true.y) phase2b <- subset(phase2b, select = -c(mat_weight_true.x, mat_weight_true.y)) table(is.na(phase2b$mat_weight_true)) # All are NA besides already sampled.
With these results in, we can move onto our final wave.
After auditing the 250 additional samples in Phase-IIb, we only have 250 left to validate until we reach our goal of 750. We will combine what we have learned in the previous waves of phase-II to optimally assign these final 250 samples in phase-IIc.
Following the same steps as in phase-IIb, we use allocate_wave()
to find the optimum sample sizes in each stratum based on estimates of the variance using our previous samples. We then determine which ids to sample using sample_strata()
.
phase2c <- phase2b # Add indicator for units that were already sampled phase2c$already_sampled <- ifelse(phase2b$id %in% c(ids_to_sample2a, ids_to_sample2b), 1, 0) # Make Design phase2c_design <- allocate_wave(data = phase2c, strata = "new_strata", y = "mat_weight_true", already_sampled = "already_sampled", nsample = 250, detailed = TRUE) phase2c_design # Find IDs to sample ids_to_sample2c <- sample_strata(data = phase2c, strata = "new_strata", id = "id", already_sampled = "already_sampled", design_data = phase2c_design, design_strata = "strata", n_allocated = "n_to_sample") ids_to_sample2c <- ids_to_sample2c[ids_to_sample2c$sample_indicator == 1,"id"] length(ids_to_sample2c) # Sample phase2c_samples <- MatWgt_Sim[MatWgt_Sim$id %in% ids_to_sample2c, c("id","mat_weight_true")] # Add samples to phase2c dataset phase2c <- merge(phase2c, phase2c_samples, by = "id", no.dups = TRUE, all.x = TRUE) phase2c$mat_weight_true <- dplyr::coalesce(phase2c$mat_weight_true.x, phase2c$mat_weight_true.y) phase2c <- subset(phase2c, select = -c(mat_weight_true.x, mat_weight_true.y))
We have now validated 750 samples!
As this example shows, there are quite a few moving parts to keep track of in the multi-wave sampling workflow. When performing a multi-wave sampling design, you may also want to consider using optimall's
multiwave object to organize and visualize the process. This feature is described in detail in the package vignette titled "The Multiwave Object".
optimall_shiny()
In many cases, deciding which strata to split and where is a difficult task. split_strata()
makes this job easier, but it is designed for situations where the strata and split values have already been decided by the user. Running it iteratively to experiment with different splits is possible yet tedious.
To help users make these difficult decisions, optimall
includes an R Shiny application that reacts in real time to user selections of splitting parameters. It can be called using optimall_shiny()
. See the screenshot below:
split_strata()
function are updated accordingly, and the resulting dataframe showing the optimum allocation of samples among the new strata is displayed. Once the user is satsfied with a set of inputs, they can confirm
the split and the underlying data will be updated. The code used to perform the split will also be displayed.
Note that the data in the Shiny App has to be loaded in from a file, and it is thus separate from data being used in the optimall
workflow. This means that the user has to return to their R session to make the changes determined in the Shiny app. The app, however, makes this easy by printing the code to perform all of the confirmed changes, so making them in R is as easy as copying and pasting. You may have to update the "data"
argument.
For more information on optimall_shiny
see the package vignette titled "Splitting Strata with Optimall Shiny".
McIsaac MA, Cook RJ. Adaptive sampling in two‐phase designs: a biomarker study for progression in arthritis. Statistics in Medicine. 2015 Sep 20;34(21):2899-912.
Wright, T. A simple method of exact optimal sample allocation under stratification with any mixed constraint patterns.2014; Statistics, 07.
Any scripts or data that you put into this service are public.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.