This report describes how to use stratified_norm() to create a stratified random sample, given an arbitrary number of factors. The function is available within the mmisc package on GitHub:

library('devtools')
install_github("mattsigal/mmisc")

Once mmisc has been downloaded, load the package as you would any other:

library('mmisc')

Before using the function directly, I will create a small dataset to use as an example. This dataset has three factor variables: Gender (2 levels), Age (3 levels), and Relationship (4 levels), as well as 3 "scale" variables (X, Y, and Z).

set.seed(77)
dat <- data.frame(Gender=sample(c("Male", "Female"), size = 1500, replace = TRUE),
                  AgeGrp=sample(c("18-39", "40-49", "50+"), size = 1500, replace = TRUE),
                  Relationship=sample(c("Direct", "Manager", "Coworker", "Friend"), 
                                      size = 1500, replace = TRUE),
                  X=rnorm(n=1500, mean=0, sd=1),
                  Y=rnorm(n=1500, mean=0, sd=1),
                  Z=rnorm(n=1500, mean=0, sd=1))
str(dat)

stratified_norm() has 6 inputs:

stratified_norm(dat, strata, observations=0, return.grid=FALSE, full.data=FALSE, full.data.id="sampled")

1) dat: a data.frame object. 2) strata: a character vector indicating the strata variables. These need to match the variable names in the dataset. 3) observations: a numeric vector indicating how many cases to sample from each strata. If the length of this vector is 1, it will be repeated for each strata group (e.g., enter 5 to sample 5 cases from each combination.) 4) return.grid: logical, if TRUE will return the strata contingeny table. 5) full.data: logical, if TRUE will return the full dataset, otherwise will only return the sampled data. 6) full.data.id: used if full.data = TRUE, indicates the name of the vector added to the data.frame to indicate the observation was sampled.

Using stratified_norm()

First, we create our strata variable. For this dataset, the relevant factors are: Gender, AgeGroup, and Relationship. Note: the input order will affect the ordering of the contingency table!

strata = c("Gender", "AgeGrp", "Relationship")

Next, let's investigate the ordering of the variables:

head(stratified_norm(dat, strata, return.grid = TRUE), n = 14)

When Relationship is entered last, it actually is ordered first (e.g., the first 6 rows of the contingency table refer to Relationship - Direct). Of course, the factors can be entered in a different order.

Now that we know the order the variables are entered in, we can define our observations vector, or how many people we want from each combination.

samples <- c(36,34,72,58,47,38,18,18,15,22,17,10,24,28,11,27,15,25,72,70,52,43,21,27)

If samples is a scalar, it will be recycled for the entire vector, otherwise it should be the same length as the number of rows in the contingency table. If it is longer or shorter, stratified_norm() will return an error. I recommend running this once with return.grid = TRUE to double check that the observations were entered correctly.

head(stratified_norm(dat = dat, strata = strata,
                    observations = samples, return.grid = TRUE), n = 14)

When we actually sample the data, we can have either the subset returned or the full dataset. Some warnings will be printed if there are less or equal numbers of counts per combination than there are observations in a particular category.

subset.data <- stratified_norm(dat, strata, samples, full.data = FALSE)

full.data <- stratified_norm(dat, strata, samples, full.data = TRUE)

str(subset.data)

str(full.data)

The return with full.data has an additional logical vector called "sampled", which indicates cases that were selected. We can check the cases using contingency tables:

ftable(xtabs(~Gender + AgeGrp + Relationship, data = subset.data))

Note, if you want the sample to be reproducible, you should include a set.seed() command first! Compare:

full.data1 <- stratified_norm(dat, strata, samples, full.data = TRUE)
full.data2 <- stratified_norm(dat, strata, samples, full.data = TRUE)
identical(full.data1, full.data2)

set.seed(77)
full.data1 <- stratified_norm(dat, strata, samples, full.data = TRUE)
set.seed(77)
full.data2 <- stratified_norm(dat, strata, samples, full.data = TRUE)
identical(full.data1, full.data2)


mattsigal/mmisc documentation built on May 21, 2019, 1:26 p.m.