In southwick-associates/sastats: Calculate Statistics

knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)

library(dplyr)
library(sastats)

Overview

Rake weighting in R is fairly straightforward using anesrake::anesrake(), for which sastats::rake_weight() is a thin wrapper. The key aspect of the process is getting the two input datasets into the right format:

The target population dataset needs to be a named list (one element per demographic variable) of target distributions. The weights::wpct() function helps with this.
The survey dataset needs to be a data frame with a unique ID variable (e.g., Vrid), and one variable for each demographic measure.

Importantly, demographic variables must be stored either as factors or numerics and the category coding much match between the two datasets. You can also see production examples for B4W-19-10 and AS/HS: rakewt-ashs.

Example Data

For demonstration, package sastats includes survey and population datasets which include several demographic variables:

library(dplyr)
library(sastats)

data(svy, pop)
svy <- select(svy$person, -weight)

# survey to be weighted
glimpse(svy)

# target population (outdoor recreationists) from a genpop survey
glimpse(pop)

Importantly, the demographic variables have been encoded in the same way between the 2 datasets (as factors in this case):

wtvars <- setdiff(names(svy), "Vrid")
sapply(wtvars, function(x) all(levels(svy[[x]]) == levels(pop[[x]])))

Population Distributions

The easiest way to see the format needed for the target population is to run weights::wpct() on a demographic variable:

weights::wpct(svy$age_weight)

For our example, we are defining the target population (outdoor recreationists) using another general population survey dataset. In this case, we need to ensure we use the stwt variable from that survey since it was itself weighted:

weights::wpct(pop$age_weight, pop$stwt)

The weights::wpct() function returns a vector, but we need a list of vectors (one element per demographic variables of interest). A straightforward method involves looping over the demographic variables with sapply():

pop_target <- sapply(wtvars, function(x) weights::wpct(pop[[x]], pop[["stwt"]]))
pop_target

If we weren't using a reference population dataset, we would need to get the target distributions into a list in some other way. For example, by hand:

partial_target <- list(
    "sex" = c("Male" = 0.517, "Female" = 0.483),
    "age_weight" = c("18-34" = 0.351, "35-54" = 0.349, "55+" = 0.300)
    # etc.
)
partial_target

Using weights::wpct() also of course provides a convenient method for comparison with the survey dataset to be weighted:

sapply(wtvars, function(x) weights::wpct(svy[[x]]))

Rake Weighting

We can now run the rake weighting procedure. By default, rake_weight() returns a list of 2 elements: (1) the survey dataset with the weight variable appended, and (2) the anesrake() return object, which includes a bunch of useful summary statistics (including "design effect", which may be useful for estimating confidence intervals).

svy_wts <- rake_weight(svy, pop_target, "Vrid")

svy <- svy_wts$svy
summary(svy$weight)

wt_summary <- summary(svy_wts$wts)
deff <- wt_summary$general.design.effect
deff

wt_summary