In southwick-associates/sastats: Calculate Statistics

knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)

library(ggplot2)
library(dplyr)
library(sastats)

Overview

Package sastats includes a few convenience functions for estimating survey sampling errors. These are simple calculations, and the functions are just thin wrappers for a bit of code. Relevant measures:

Standard error of the mean: error_se_mean()
Standard error of a proportion: error_se_prop()
Margin of error (for confidence intervals): error_me()

Notes

Package survey provides a much more comprehensive approach to survey-based calculations (errors, weighting, etc.). I've tended toward the light-weight approach outlined here, but it could be worth looking into if we are doing alot of survey analysis in R.

Caculating errors for multi-estimate metrics requires error propagation (e.g., total days which depends on participation rate & average days) . I haven't implemented these computations here, but existing packages address this need. Package propagate is one I've used, but have found tricky to implement. Package errors is newer and appears more straightforward (although I haven't tested it).

Example Data

For demonstration, package sastats includes a survey dataset with annual participation metrics for 9 outdoor recreation activities:

library(dplyr)
library(sastats)
data(svy)

activity <- left_join(svy$act, select(svy$person, Vrid, weight), by = "Vrid")
glimpse(activity)

SE Mean

Looking at days of participation:

days <- activity %>%
    group_by(act) %>% 
    summarise(
        avgdays = weighted.mean(days, weight, na.rm = TRUE),
        se = error_se_mean(days, na.rm = TRUE)
)
days

SE Proportion

Looking at participation rate:

rate <- activity %>%
    group_by(act, part) %>%
    summarise(n = n(), wtn = sum(weight)) %>%
    mutate(
        n = sum(n), 
        rate = wtn / sum(wtn),
        se = error_se_prop(rate, n)
    ) %>%
    filter(part == "Checked")
rate

Margin of Error

These are useful for reporting confidence intervals. Note that the dataset weighting produces a "design effect" which inflates the margin of error. I know the design effect for this dataset, based on the summary output produced by the sastats::rake_weight() procedure.

deff <- 1.19
rate <- mutate(rate, me = error_me(se) * deff, lower = rate - me, upper = rate + me)
days <- mutate(days, me = error_me(se) * deff, lower = avgdays - me, upper = avgdays + me)

library(ggplot2)
ggplot(rate, aes(act, rate)) +
  geom_point() +
  geom_errorbar(aes(ymin = lower, ymax = upper)) +
  ggtitle("Participation Rate Estimates (with 95% CIs)")

ggplot(days, aes(act, avgdays)) +
  geom_point() +
  geom_errorbar(aes(ymin = lower, ymax = upper)) +
  ggtitle("Average Days (per participant) Estimates (with 95% CIs)")