knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>",
  fig.path = "man/figures/README-",
  out.width = "100%"
)
library(svydb)

svydb - Survey statistics in a database

Introduction

Most current software for survey analysis reads the data into memory, however, most of these computations can actually be expressed as database operations. The aim of this package is to provide a set of functions which allows survey statistics to be computed directly inside a database. To do this, dplyr and dbplpy are used. Lastly, this package takes a step further, it also provides graphics computation directly inside a database, where data tables are read into memory only when necessary.

As noted by Professor Lumley here. This "approach does seem to be useful for large survey data sets – and for smaller data sets the dplyr version is faster than the survey package, though more limited."

Installation

This packages can be installed by using the devtools package.

devtools::install_github("chrk623/svydb")

Currently, there are some warning messages regarding depreciated functions in rlang, at the moment this should not effect svydb, updates will be made soon.

This package is still under development, if there are any BUGS please report to Issues.

Examples

Survey Statistics

Survey Total

Note that databases does not understand what a factor is. Therefore, if a variable is factor, this information needs to be provided to the function via the num argument.

library(svydb)
data(nhane)
nh.dbsurv = svydbdesign(st = SDMVSTRA, wt = WTMEC2YR, id = SDMVPSU, data = nhane)
svydbtotal(x = DirectChol, design = nh.dbsurv, num = T)

Survey Mean

my_mean = svydbmean(x = Race3, design = nh.dbsurv, num = F)
my_mean
coef(my_mean)
SE(my_mean)

Regression

In svydblm we denote factors by wrapping the variable with factor().

fit.dbsurv = svydblm(DirectChol ~ Age + BMI + factor(Gender), design = nh.dbsurv)
summary(fit.dbsurv)
coef(fit.dbsurv)
SE(fit.dbsurv)
vcov(fit.dbsurv)

Replicated weights along with some other survey statistics are also supported in svydb, please read the help page for more information.

Graphics

All graphics produced by svydb uses ggplot2, this means that users can customise their plot with existing ggplot2 functions. For example, add a title with ggtitle().

Histogram

svydbhist(x = DirectChol, design = nh.dbsurv)
svydbhist(x = Age, design = nh.dbsurv, binwidth = 3)

Boxplot

p = svydbboxplot(x = Weight, groups = Race3, design = nh.dbsurv,
     outlier = T, all.outlier = T, varwidth = T)
p + ggtitle("Weight by Race3 Boxplot") + theme_bw()

Hexagon Binning

hb = svydbhexbin(Height ~ Weight, design = nh.dbsurv)
svydbhexplot(hb)

SQL Usage

To use svydb functions with a SQL connection, it is as simple as creating a svydbdesign with a SQL table in R. Currently, only MonetDB has been tested. In theory, it should work with other databases as long as it provides enough supported functions that svydb requires. Or, at the very least these database should work with svydb with some minor tweaking.

To do this,

# devtools::install_github("hannesmuehleisen/MonetDBLite-R")
# install.packages(c("DBI", "dbplyr"))
library(MonetDBLite)
library(DBI)
library(dbplyr)
con = dbConnect(MonetDBLite())
dbWriteTable(con, "nhane", nhane)
nhane.db = tbl(con, "nhane")

Then for example,

nh2.dbsurv = svydbdesign(st = SDMVSTRA, wt = WTMEC2YR, id = SDMVPSU, data = nhane.db)
svydbtotal(x = DirectChol, design = nh2.dbsurv, num = T)

Please note that, MonetDBLite is no longer available on CRAN. Therefore, if one wishes to continue using MonetDB, you must install it via Github. More information can be found here.

Timings

The following timings compares the speed of the survey package and svydb. The red line represents the survey pacakge, green line represents svydb with a local data frame, and the blue line represents svydb with a database connection.

Survey Total Timing

Survey Mean Timing

Regression Timing

Histogram Timing

Boxplot Timing

Hexagon Binning Timing

Abstract

Multistage surveys can give rise to moderately large data sets (tens of millions of rows). Most current software for survey analysis reads the data into memory, the survey package in R provides fairly comprehensive analysis features for complex surveys which are small enough to fit into memory easily, however, most of the computations can actually be expressed as database operations. There is already a similar approach with the sqlsurvey package in R which performs substantial computation in SQL in the database, importing only small summary tables into R, this approach scales to very large surveys such as the American Community Survey and the Nationwide Emergency Department Sample, but this approach causes compatible issues with different types of databases. Therefore, in this project I will work on implementing R functions and testing some survey computations using the dplyr and dbplyr R package as a database interface.

Supervisor: Professor Thomas Lumley

Thomas wrote something about this package here.



chrk623/svydb documentation built on Oct. 6, 2019, 6:54 a.m.