The srvyr
package adds dplyr
like syntax to the survey
package.
This vignette focuses on how srvyr
compares to the survey
package, for more
information about survey design and analysis, check out the vignettes in the survey
package, or Thomas Lumley's book, Complex Surveys: A Guide to Analysis
Using R. (Also see the bottom
of this document for some more resources).
Everything that srvyr
can do, can also be done in survey
. In fact, behind
the scenes the survey
package is doing all of the hard work for srvyr
.
srvyr
strives to make your code simpler and more easily readable to you,
especially if you are already used to the dplyr
package.
The dplyr
package has made it easy to write code to summarize data.
For example, if we wanted to check how the year-to-year change in academic
progress indicator score varied by school level and percent of parents were
high school graduates, we can do this:
library(survey) library(ggplot2) library(dplyr) data(api) out <- apistrat %>% mutate(hs_grad_pct = cut(hsg, c(0, 20, 100), include.lowest = TRUE, labels = c("<20%", "20+%"))) %>% group_by(stype, hs_grad_pct) %>% summarize(api_diff = weighted.mean(api00 - api99, pw), n = n()) ggplot(data = out, aes(x = stype, y = api_diff, group = hs_grad_pct, fill = hs_grad_pct)) + geom_col(stat = "identity", position = "dodge") + geom_text(aes(y = 0, label = n), position = position_dodge(width = 0.9), vjust = -1)
However, if we wanted to add error bars to the graph to capture the uncertainty due to sampling
variation, we have to completely rewrite the dplyr
code for the survey
package.
srvyr
allows a more direct translation.
as_survey_design()
, as_survey_rep()
and as_survey_twophase()
are analogous
to survey::svydesign()
, survey::svrepdesign()
and survey::twophase()
respectively. Because they are designed to match dplyr
's style of non-standard
evaluation, they accept bare column names instead of formulas (~). They also
move the data argument first, so that it is easier to use magrittr
pipes
(%>%
).
library(srvyr) # simple random sample srs_design_srvyr <- apisrs %>% as_survey_design(ids = 1, fpc = fpc) srs_design_survey <- svydesign(ids = ~1, fpc = ~fpc, data = apisrs)
The srvyr
functions also accept dplyr::select()
's special selection
functions (such as starts_with()
, one_of()
, etc.), so these functions are
analogous:
# selecting variables to keep in the survey object (stratified example) strat_design_srvyr <- apistrat %>% as_survey_design(1, strata = stype, fpc = fpc, weight = pw, variables = c(stype, starts_with("api"))) strat_design_survey <- svydesign(~1, strata = ~stype, fpc = ~fpc, variables = ~stype + api99 + api00 + api.stu, weight = ~pw, data = apistrat)
The function as_survey()
will automatically choose between the three as_survey_*
functions based on the arguments, so you can save a few keystrokes.
# simple random sample (again) srs_design_srvyr2 <- apisrs %>% as_survey(ids = 1, fpc = fpc)
Once you've set up your survey data, you can use dplyr
verbs such as mutate()
,
select()
, filter()
and rename()
.
strat_design_srvyr <- strat_design_srvyr %>% mutate(api_diff = api00 - api99) %>% rename(api_students = api.stu) strat_design_survey$variables$api_diff <- strat_design_survey$variables$api00 - strat_design_survey$variables$api99 names(strat_design_survey$variables)[names(strat_design_survey$variables) == "api.stu"] <- "api_students"
Note that arrange()
is not available, because the srvyr
object expects to
stay in the same order. Nor are two-table verbs such as full_join()
,
bind_rows()
, etc. available to srvyr
objects either because they may have
implications on the survey design. If you need to use these functions, you
should use them earlier in your analysis pipeline, when the objects are still
stored as data.frame
s.
srvyr
also provides summarize()
and several survey-specific functions that
calculate summary statistics on numeric variables: survey_mean()
, survey_total()
,
survey_quantile()
and survey_ratio()
. These functions differ from their
counterparts in survey
because they always return a data.frame in a consistent
format. As such, they do not return the variance-covariance matrix, and so are
not as flexible.
# Using srvyr out <- strat_design_srvyr %>% summarize(api_diff = survey_mean(api_diff, vartype = "ci")) out # Using survey out <- svymean(~api_diff, strat_design_survey) out confint(out)
srvyr
also allows you to calculate statistics on numeric variables by group,
using group_by()
.
# Using srvyr strat_design_srvyr %>% group_by(stype) %>% summarize(api_increase = survey_total(api_diff >= 0), api_decrease = survey_total(api_diff < 0)) # Using survey svyby(~api_diff >= 0, ~stype, strat_design_survey, svytotal)
You can also calculate the proportion or count in each group of a factor
or character variable by leaving x empty in survey_mean()
or survey_total()
.
# Using srvyr srs_design_srvyr %>% group_by(awards) %>% summarize(proportion = survey_mean(), total = survey_total()) # Using survey svymean(~awards, srs_design_survey) svytotal(~awards, srs_design_survey)
Finally, the unweighted()
function can act as an escape hatch to calculate unweighted
calculations on the dataset.
# Using srvyr strat_design_srvyr %>% group_by(stype) %>% summarize(n = unweighted(n())) # Using survey svyby(~api99, ~stype, strat_design_survey, unwtd.count)
So now, we have all the tools needed to create the first graph and add error bounds.
Notice that the data manipulation code is nearly identical to the dplyr
code, with a
little extra set up, and replacing weighted.mean()
with survey_mean
.
strat_design <- apistrat %>% as_survey_design(strata = stype, fpc = fpc, weight = pw) out <- strat_design %>% mutate(hs_grad_pct = cut(hsg, c(0, 20, 100), include.lowest = TRUE, labels = c("<20%", "20+%"))) %>% group_by(stype, hs_grad_pct) %>% summarize(api_diff = survey_mean(api00 - api99, vartype = "ci"), n = unweighted(n())) ggplot(data = out, aes(x = stype, y = api_diff, group = hs_grad_pct, fill = hs_grad_pct, ymax = api_diff_upp, ymin = api_diff_low)) + geom_col(stat = "identity", position = "dodge") + geom_errorbar(position = position_dodge(width = 0.9), width = 0.1) + geom_text(aes(y = 0, label = n), position = position_dodge(width = 0.9), vjust = -1)
For the most part, srvyr
tries to be a drop-in replacement for the survey package, only
changing the syntax that you wrote. However, the way that calculations of degrees of
freedom when calculating confidence intervals is different.
srvyr
assumes that you want to use the true degrees of freedom by default, but the survey
package uses Inf
as the default. You can use the argument df
to get the same result
as the survey package.
# Set pillar print methods so tibble has more decimal places old_sigfig <- options("pillar.sigfig") options("pillar.sigfig" = 5) # survey default estimate <- svymean(~api99, strat_design) confint(estimate) # srvyr default strat_design %>% summarize(x = survey_mean(api99, vartype = "ci")) # setting the degrees of freedom so srvyr matches survey default strat_design %>% summarize(x = survey_mean(api99, vartype = "ci", df = Inf)) %>% print() # setting the degrees of freedom so survey matches survey default confint(estimate, df = degf(strat_design)) # reset significant figures options("pillar.sigfig" = old_sigfig)
survey
functions on srvyr
objectsBecause srvyr
objects are just survey
objects with some extra structure,
all of the functions from survey
will still work with them. If you need to calculate
something beyond simple summary statistics, you can use survey
functions.
glm <- svyglm(api00 ~ ell + meals + mobility, design = strat_design) summary(glm)
Like dplyr
, srvyr
allows you to use expressions in the arguments,
allowing you to create variables in a single step. For example, you can
use expressions:
1) as the arguments inside the survey statistic functions like survey_mean
strat_design %>% summarize(prop_api99_over_700 = survey_mean(api99 > 700))
2) as an argument to summarize
strat_design %>% group_by(awards) %>% summarize(percentage = 100 * survey_mean())
3) and you can even create variables inside of group_by
strat_design %>% group_by(api99_above_700 = api99 > 700) %>% summarize(api00_mn = survey_mean(api00))
Though on-the-fly expressions are syntactically valid, it is possible to make statistically invalid numbers from them. For example, though the standard error and confidence intervals can be multiplied by a scalar (like 100), the variance does not scale the same way, so the following is invalid:
# BAD DON'T DO THIS! strat_design %>% group_by(awards) %>% summarize(percentage = 100 * survey_mean(vartype = "var")) # VARIANCE IS WRONG
Srvyr supports the non-standard evaluation conventions that dplyr uses.
If you'd like to use a function programmatically, you can use the functions from
rlang like the {{
operator (aka "curly curly") from rlang
.
Here's a quick example, but please see the dplyr vignette
vignette("programming", package = "dplyr")
for more details.
mean_with_ci <- function(.data, var) { summarize(.data, mean = survey_mean({{var}}, vartype = "ci")) } srs_design_srvyr <- apisrs %>% as_survey_design(fpc = fpc) mean_with_ci(srs_design_srvyr, api99)
Srvyr will also follow dplyr's lead on deprecating the old methods of NSE,
such as rlang::quo
, and !!
, in addition to the so-called "underscore functions"
(like summarize_
). Currently, they have been soft-deprecated, they may be
removed altogether in some future version of srvyr.
As of version 1.0 of srvyr, it supports dplyr's across function, so when you want
to calculate a statistic on more than one variable, it is easy to do so.
See vignette("colwise", package = "dplyr")
for more details, but here is another quick example:
# Calculate survey mean for all variables that have names starting with "api" strat_design %>% summarize(across(starts_with("api"), survey_mean))
Srvyr also supports older methods of working column-wise, the "scoped variants", such as
summarize_at
, summarize_if
, summarize_all
and summarize_each
. Again, these are
maintained for backwards compatibility, matching what the tidyverse team has done, but
may be removed from a future version.
You can calculate the weighted proportion that falls into a group using the
survey_prop()
function (or the survey_mean()
function with no x
argument).
The proportion is calculated by "unpeeling" the last variable used in group_by()
and then calculating the proportion within the other groups that fall into the last
group (so that the proportion within each group that was unpeeled sums to 100%).
# Calculate the proportion that falls into each category of `awards` per `stype` strat_design %>% group_by(stype, awards) %>% summarize(prop = survey_prop())
If you want to calculate the proportion for groups from multiple variables at the same
time that add up to 100%, the interact
function can help. The interact
function
creates a variable that is automatically split apart so that more than one variable
can be unpeeled.
# Calculate the proportion that falls into each category of both `awards` and `stype` strat_design %>% group_by(interact(stype, awards)) %>% summarize(prop = survey_prop())
Here are some free resources put together by the community about srvyr:
dplyr::across
and rlang
's "curly curly" {{}}
)Still need help?
I think the best way to get help is to form a specific question and ask it in some place like posit's community website (known for it's friendly community) or stackoverflow.com (maybe not known for being quite as friendly, but probably has more people). If you think you've found a bug in srvyr's code, please file an issue on GitHub, but note that I'm not a great resource for helping specific issue, both because I have limited capacity but also because I do not consider myself an expert in the statistical methods behind survey analysis.
Have something to add?
These resources were mostly found via vanity searches on twitter & github. If you know of anything I missed, or have written something yourself, please let me know in this GitHub issue!
Any scripts or data that you put into this service are public.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.