srvyr package aims to add
dplyr like syntax to the
This vignette focuses on how
srvyr compares to the
survey package, for more
information about survey design and analysis, check out the vignettes in the
package, or Thomas Lumley's book, Complex Surveys: A Guide to Analysis
srvyr can do, can also be done in
survey. In fact, behind
the scenes the
survey package is doing all of the hard work for
srvyr strives to make your code simpler and more easily readable to you,
especially if you are already used to the
dplyr package has made it fast and easy to write code to summarize data.
For example, if we wanted to check how the year-to-year change in academic
progress indicator score varied by school level and percent of parents were
high school graduates, we can do this:
library(survey) library(ggplot2) library(dplyr) data(api) out <- apistrat %>% mutate(hs_grad_pct = cut(hsg, c(0, 20, 100), include.lowest = TRUE, labels = c("<20%", "20+%"))) %>% group_by(stype, hs_grad_pct) %>% summarize(api_diff = weighted.mean(api00 - api99, pw), n = n()) ggplot(data = out, aes(x = stype, y = api_diff, group = hs_grad_pct, fill = hs_grad_pct)) + geom_col(stat = "identity", position = "dodge") + geom_text(aes(y = 0, label = n), position = position_dodge(width = 0.9), vjust = -1)
However, if we wanted to add error bars to the graph to capture the uncertainty due to sampling
variation, we have to completely rewrite the
dplyr code for the
srvyr allows a more direct translation.
as_survey_twophase() are analogous
respectively. Because they are designed to match
dplyr's style of non-standard
evaluation, they accept bare column names instead of formulas (~). They also
move the data argument first, so that it is easier to use
library(srvyr) # simple random sample srs_design_srvyr <- apisrs %>% as_survey_design(ids = 1, fpc = fpc) srs_design_survey <- svydesign(ids = ~1, fpc = ~fpc, data = apisrs)
srvyr functions also accept
dplyr::select()'s special selection
functions (such as
one_of(), etc.), so these functions are
# selecting variables to keep in the survey object (stratified example) strat_design_srvyr <- apistrat %>% as_survey_design(1, strata = stype, fpc = fpc, weight = pw, variables = c(stype, starts_with("api"))) strat_design_survey <- svydesign(~1, strata = ~stype, fpc = ~fpc, variables = ~stype + api99 + api00 + api.stu, weight = ~pw, data = apistrat)
as_survey() will automatically choose between the three
functions based on the arguments, so you can save a few keystrokes.
# simple random sample (again) srs_design_srvyr2 <- apisrs %>% as_survey(ids = 1, fpc = fpc)
Once you've set up your survey data, you can use
dplyr verbs such as
strat_design_srvyr <- strat_design_srvyr %>% mutate(api_diff = api00 - api99) %>% rename(api_students = api.stu) strat_design_survey$variables$api_diff <- strat_design_survey$variables$api00 - strat_design_survey$variables$api99 names(strat_design_survey$variables)[names(strat_design_survey$variables) == "api.stu"] <- "api_students"
arrange() is not available, because the
srvyr object expects to
stay in the same order. Nor are two-table verbs such as
bind_rows(), etc. available to
srvyr objects either because they may have
implications on the survey design. If you need to use these functions, you
should use them earlier in your analysis pipeline, when the objects are still
srvyr also provides
summarize() and several survey-specific functions that
calculate summary statistics on numeric variables:
survey_ratio(). These functions differ from their
survey because they always return a data.frame in a consistent
format. As such, they do not return the variance-covariance matrix, and so are
not as flexible.
# Using srvyr out <- strat_design_srvyr %>% summarize(api_diff = survey_mean(api_diff, vartype = "ci")) out # Using survey out <- svymean(~api_diff, strat_design_survey) out confint(out)
srvyr also allows you to calculate statistics on numeric variables by group,
# Using srvyr strat_design_srvyr %>% group_by(stype) %>% summarize(api_increase = survey_total(api_diff >= 0), api_decrease = survey_total(api_diff < 0)) # Using survey svyby(~api_diff >= 0, ~stype, strat_design_survey, svytotal)
You can also calculate the proportion or count in each group of a factor
or character variable by leaving x empty in
# Using srvyr srs_design_srvyr %>% group_by(awards) %>% summarize(proportion = survey_mean(), total = survey_total()) # Using survey svymean(~awards, srs_design_survey) svytotal(~awards, srs_design_survey)
unweighted() function can act as an escape hatch to calculate unweighted
calculations on the dataset.
# Using srvyr strat_design_srvyr %>% group_by(stype) %>% summarize(n = unweighted(n())) # Using survey svyby(~api99, ~stype, strat_design_survey, unwtd.count)
So now, we have all the tools needed to create the first graph and add error bounds.
Notice that the data manipulation code is nearly identical to the
dplyr code, with a
little extra set up, and replacing
strat_design <- apistrat %>% as_survey_design(strata = stype, fpc = fpc, weight = pw) out <- strat_design %>% mutate(hs_grad_pct = cut(hsg, c(0, 20, 100), include.lowest = TRUE, labels = c("<20%", "20+%"))) %>% group_by(stype, hs_grad_pct) %>% summarize(api_diff = survey_mean(api00 - api99, vartype = "ci"), n = unweighted(n())) ggplot(data = out, aes(x = stype, y = api_diff, group = hs_grad_pct, fill = hs_grad_pct, ymax = api_diff_upp, ymin = api_diff_low)) + geom_col(stat = "identity", position = "dodge") + geom_errorbar(position = position_dodge(width = 0.9), width = 0.1) + geom_text(aes(y = 0, label = n), position = position_dodge(width = 0.9), vjust = -1)
srvyr objects are just
survey objects with some extra structure,
all of the functions from
survey will still work with them. If you need to calculate
something beyond simple summary statistics, you can use
glm <- svyglm(api00 ~ ell + meals + mobility, design = strat_design) summary(glm)
Srvyr now supports the standard evaluation conventions introduced in dplyr version 0.7 and
rlang. If you'd like to use a function programmatically, you can use the functions from
rlang::sym() to capture the expression and
to unquote it. All of these functions are re-exported by srvyr, so you don't need
to load the rlang library to use them.
Here's a quick example, but please see the dplyr vignette
vignette("programming", package = "dplyr")
for more details.
fpc_var <- sym("fpc") srs_design_srvyr <- apisrs %>% as_survey_design(fpc = !!fpc_var) grouping_var <- sym("stype") api_diff <- quo(api00 - api99) srs_design_srvyr %>% group_by(!!grouping_var) %>% summarize( api_increase = survey_total((!!api_diff) >= 0), api_decrease = survey_total((!!api_diff) < 0) )
Srvyr will also follow dplyr's lead on deprecating the old method of NSE, the
so-called "underscore functions" (like
summarize_). Currently, they have been
soft-deprecated, but I expect them to be removed altogether in some future version
Srvyr has also been able to take advantage of the new-ish dplyr "scoped" variants of
the main manipulation functions like
summarize_at(). These functions allow you to
quickly calculate summary statistics for multiple variables. For example:
# Calculate survey mean for all variables that have names starting with "api" strat_design %>% summarize_at(vars(starts_with("api")), survey_mean)
The dplyr documentation
dplyr::scoped provides more details.
Any scripts or data that you put into this service are public.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.