`srvyr` compared to the `survey` package

The srvyr package aims to add dplyr like syntax to the survey package. This vignette focuses on how srvyr compares to the survey package, for more information about survey design and analysis, check out the vignettes in the survey package, or Thomas Lumley's book, Complex Surveys: A Guide to Analysis Using R.

Everything that srvyr can do, can also be done in survey. In fact, behind the scenes the survey package is doing all of the hard work for srvyr. srvyr strives to make your code simpler and more easily readable to you, especially if you are already used to the dplyr package.

Motivating example

The dplyr package has made it fast and easy to write code to summarize data. For example, if we wanted to check how the year-to-year change in academic progress indicator score varied by school level and percent of parents were high school graduates, we can do this:

library(survey)
library(ggplot2)
library(dplyr)

data(api)

out <- apistrat %>%
  mutate(hs_grad_pct = cut(hsg, c(0, 20, 100), include.lowest = TRUE,
                           labels = c("<20%", "20+%"))) %>%
  group_by(stype, hs_grad_pct) %>%
  summarize(api_diff = weighted.mean(api00 - api99, pw),
            n = n())

ggplot(data = out, aes(x = stype, y = api_diff, group = hs_grad_pct, fill = hs_grad_pct)) +
  geom_bar(stat = "identity", position = "dodge") +
  geom_text(aes(y = 0, label = n), position = position_dodge(width = 0.9), vjust = -1)

However, if we wanted to add error bars to the graph to capture the uncertainty due to sampling variation, we have to completely rewrite the dplyr code for the survey package. srvyr allows a more direct translation.

Preparing a survey dataset

as_survey_design(), as_survey_rep() and as_survey_twophase() are analagous to survey::svydesign(), survey::svrepdesign() and survey::twophase() respectively. Because they are designed to match dplyr's style of non-standard evaluation, they accept bare column names instead of formulas (~). They also move the data argument first, so that it is easier to use magrittr pipes (%>%).

library(srvyr)

# simple random sample
srs_design_srvyr <- apisrs %>% as_survey_design(ids = 1, fpc = fpc)

srs_design_survey <- svydesign(ids = ~1, fpc = ~fpc, data = apisrs)

The srvyr functions also accept dplyr::select()'s special selection functions (such as starts_with(), one_of(), etc.), so these functions are analagous:

# selecting variables to keep in the survey object (stratified example)
strat_design_srvyr <- apistrat %>%
  as_survey_design(1, strata = stype, fpc = fpc, weight = pw,
                variables = c(stype, starts_with("api")))

strat_design_survey <- svydesign(~1, strata = ~stype, fpc = ~fpc,
                                 variables = ~stype + api99 + api00 + api.stu,
                                 weight = ~pw, data = apistrat)

The function as_survey() will automatically choose between the three as_survey_* functions based on the arguments, so you can save a few keystrokes.

# simple random sample (again)
srs_design_srvyr2 <- apisrs %>% as_survey(ids = 1, fpc = fpc)

Data manipulation

Once you've set up your survey data, you can use dplyr verbs such as mutate(), select(), filter() and rename().

strat_design_srvyr <- strat_design_srvyr %>%
  mutate(api_diff = api00 - api99) %>%
  rename(api_students = api.stu)

strat_design_survey$variables$api_diff <- strat_design_survey$variables$api00 -
  strat_design_survey$variables$api99
names(strat_design_survey$variables)[names(strat_design_survey$variables) == "api.stu"] <- "api_students"

Note that arrange() is not available, because the srvyr object expects to stay in the same order. Nor are two-table verbs such as full_join(), bind_rows(), etc. available to srvyr objects either because they may have implications on the survey design. If you need to use these functions, you should use them earlier in your analysis pipeline, when the objects are still stored as data.frames.

Summary statistics

Of the entire population

srvyr also provides summarize() and several survey-specific functions that calculate summary statistics on numeric variables: survey_mean(), survey_total(), survey_quantile() and survey_ratio(). These functions differ from their counterparts in survey because they always return a data.frame in a consistent format. As such, they do not return the variance-covariance matrix, and so are not as flexible.

# Using srvyr
out <- strat_design_srvyr %>%
  summarize(api_diff = survey_mean(api_diff, vartype = "ci"))

out

# Using survey
out <- svymean(~api_diff, strat_design_survey)

out
confint(out)

By group

srvyr also allows you to calculate statistics on numeric variables by group, using group_by().

# Using srvyr
strat_design_srvyr %>%
  group_by(stype) %>%
  summarize(api_increase = survey_total(api_diff >= 0),
            api_decrease = survey_total(api_diff < 0))

# Using survey
svyby(~api_diff >= 0, ~stype, strat_design_survey, svytotal)

Proportions by group

You can also calculate the proportion or count in each group of a factor or character variable by leaving x empty in survey_mean() or survey_total().

# Using srvyr
srs_design_srvyr %>%
  group_by(awards) %>%
  summarize(proportion = survey_mean(),
            total = survey_total())

# Using survey
svymean(~awards, srs_design_survey)
svytotal(~awards, srs_design_survey)

Unweighted calculations

Finally, the unweighted() function can act as an escape hatch to calculate unweighted calculations on the dataset.

# Using srvyr
strat_design_srvyr %>%
  group_by(stype) %>%
  summarize(n = unweighted(n()))

# Using survey
svyby(~api99, ~stype, strat_design_survey, unwtd.count)

Back to the example

So now, we have all the tools needed to create the first graph and add error bounds. Notice that the data manipulation code is nearly identical to the dplyr code, with a little extra set up, and replacing weighted.mean() with survey_mean.

strat_design <- apistrat %>%
  as_survey_design(strata = stype, fpc = fpc, weight  = pw)

out <- strat_design %>%
  mutate(hs_grad_pct = cut(hsg, c(0, 20, 100), include.lowest = TRUE,
                           labels = c("<20%", "20+%"))) %>%
  group_by(stype, hs_grad_pct) %>%
  summarize(api_diff = survey_mean(api00 - api99, vartype = "ci"),
            n = unweighted(n()))

ggplot(data = out, aes(x = stype, y = api_diff, group = hs_grad_pct, fill = hs_grad_pct,
                       ymax = api_diff_upp, ymin = api_diff_low)) +
  geom_bar(stat = "identity", position = "dodge") +
  geom_errorbar(position = position_dodge(width = 0.9), width = 0.1) +
  geom_text(aes(y = 0, label = n), position = position_dodge(width = 0.9), vjust = -1)

Grab Bag

Using survey functions on srvyr objects

Because srvyr objects are just survey objects with some extra structure, all of the functions from survey will still work with them. If you need to calculate something beyond simple summary statistics, you can use survey functions.

glm <- svyglm(api00 ~ ell + meals + mobility, design = strat_design)
summary(glm)

Standard evaluation

Srvyr now supports the standard evaluation conventions introduced in dplyr version 0.7 and rlang. If you'd like to use a function programmatically, you can use the functions from rlang like rlang::quo() or rlang::sym() to capture the expression and rlang::!! to unquote it. All of these functions are re-exported by srvyr, so you don't need to load the rlang library to use them.

Here's a quick example, but please see the dplyr vignette vignette("programming", package = "dplyr") for more details.

fpc_var <- sym("fpc")
srs_design_srvyr <- apisrs %>% as_survey_design(fpc = !!fpc_var)

grouping_var <- sym("stype")
api_diff <- quo(api00 - api99)

srs_design_srvyr %>%
  group_by(!!grouping_var) %>% 
  summarize(
    api_increase = survey_total((!!api_diff) >= 0),
    api_decrease = survey_total((!!api_diff) < 0)
  )

Srvyr will also follow dplyr's lead on deprecating the old method of NSE, the so-called "underscore functions" (like summarize_). Currently, they have been soft-deprecated, but I expect them to be removed altogether in some future version of srvyr.

Scoped functions

Srvyr has also been able to take advantage of the new-ish dplyr "scoped" variants of the main manipulation functions like summarize_at(). These functions allow you to quickly calculate variables for multiple variables. For example:

# Calculate survey mean for all variables that have names starting with "api"
strat_design %>%
  summarize_at(vars(starts_with("api")), survey_mean)

The dplyr documentation dplyr::scoped provides more details.



Try the srvyr package in your browser

Any scripts or data that you put into this service are public.

srvyr documentation built on May 22, 2018, 5:06 p.m.