knitr::opts_chunk$set( collapse = TRUE, comment = "#>", fig.path = "man/figures/README-", out.width = "100%" ) library(svydb)
Most current software for survey analysis reads the data into memory, however, most of these computations can actually be expressed as database operations. The aim of this package is to provide a set of functions which allows survey statistics to be computed directly inside a database. To do this, dplyr
and dbplpy
are used.
Lastly, this package takes a step further, it also provides graphics computation directly inside a database, where data tables are read into memory only when necessary.
As noted by Professor Lumley here. This "approach does seem to be useful for large survey data sets – and for smaller data sets the dplyr version is faster than the survey package, though more limited."
This packages can be installed by using the devtools
package.
devtools::install_github("chrk623/svydb")
Currently, there are some warning messages regarding depreciated functions in rlang
, at the moment this should not effect svydb
, updates will be made soon.
This package is still under development, if there are any BUGS please report to Issues.
Note that databases does not understand what a factor
is. Therefore, if a variable is factor
, this information needs to be provided to the function via the num
argument.
library(svydb) data(nhane) nh.dbsurv = svydbdesign(st = SDMVSTRA, wt = WTMEC2YR, id = SDMVPSU, data = nhane) svydbtotal(x = DirectChol, design = nh.dbsurv, num = T)
my_mean = svydbmean(x = Race3, design = nh.dbsurv, num = F) my_mean coef(my_mean) SE(my_mean)
In svydblm
we denote factors by wrapping the variable with factor()
.
fit.dbsurv = svydblm(DirectChol ~ Age + BMI + factor(Gender), design = nh.dbsurv) summary(fit.dbsurv) coef(fit.dbsurv) SE(fit.dbsurv) vcov(fit.dbsurv)
Replicated weights along with some other survey statistics are also supported in svydb
, please read the help page for more information.
All graphics produced by svydb
uses ggplot2
, this means that users can customise their plot with existing ggplot2
functions. For example, add a title with ggtitle()
.
svydbhist(x = DirectChol, design = nh.dbsurv) svydbhist(x = Age, design = nh.dbsurv, binwidth = 3)
p = svydbboxplot(x = Weight, groups = Race3, design = nh.dbsurv, outlier = T, all.outlier = T, varwidth = T) p + ggtitle("Weight by Race3 Boxplot") + theme_bw()
hb = svydbhexbin(Height ~ Weight, design = nh.dbsurv) svydbhexplot(hb)
To use svydb
functions with a SQL connection, it is as simple as creating a svydbdesign
with a SQL table in R. Currently, only MonetDB
has been tested. In theory, it should work with other databases as long as it provides enough supported functions that svydb
requires. Or, at the very least these database should work with svydb
with some minor tweaking.
To do this,
# devtools::install_github("hannesmuehleisen/MonetDBLite-R") # install.packages(c("DBI", "dbplyr")) library(MonetDBLite) library(DBI) library(dbplyr) con = dbConnect(MonetDBLite()) dbWriteTable(con, "nhane", nhane) nhane.db = tbl(con, "nhane")
Then for example,
nh2.dbsurv = svydbdesign(st = SDMVSTRA, wt = WTMEC2YR, id = SDMVPSU, data = nhane.db) svydbtotal(x = DirectChol, design = nh2.dbsurv, num = T)
Please note that, MonetDBLite
is no longer available on CRAN. Therefore, if one wishes to continue using MonetDB
, you must install it via Github. More information can be found here.
The following timings compares the speed of the survey
package and svydb
. The red line represents the survey
pacakge, green line represents svydb
with a local data frame, and the blue line represents svydb
with a database connection.
Multistage surveys can give rise to moderately large data sets (tens of millions of rows). Most current software for survey analysis reads the data into memory, the survey package in R provides fairly comprehensive analysis features for complex surveys which are small enough to fit into memory easily, however, most of the computations can actually be expressed as database operations. There is already a similar approach with the sqlsurvey package in R which performs substantial computation in SQL in the database, importing only small summary tables into R, this approach scales to very large surveys such as the American Community Survey and the Nationwide Emergency Department Sample, but this approach causes compatible issues with different types of databases. Therefore, in this project I will work on implementing R functions and testing some survey computations using the dplyr and dbplyr R package as a database interface.
Supervisor: Professor Thomas Lumley
Thomas wrote something about this package here.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.