The package is intended for use in an education setting working with survey data. It provides tools for computing statistics from surveys using simple random sampling and stratified designs. Plotting is simplified with a thin wrapper on ggplot2 functions, enabling a user to easily create aesthetic graphs. Functions for deriving the sampling distribution of a small population and probability demos are also included in the package. The probability demos cover the law of large numbers and the central limit theorem.
You can install the development version of surveyr from GitHub with:
# install.packages("devtools")
devtools::install_github("danjdrennan/surveyr")
The most basic element for studies is a simple random sample with a
known population size. With data from a simple random sample, one can
estimate the population mean, total, or proportion using mk_stat
as
follows:
# This library is designed to work alongside tidy packages,
# especially tidyr, tibble, dplyr, and ggplot2
library(dplyr)
#>
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
library(tibble)
library(ggplot2)
library(surveyr)
# Generate a dataset to for README examples
set.seed(1)
d <- tibble::tibble(
region = rep(1:5, 50),
y = rnorm(250, 30 + 5*region, 5*sqrt(region))
)
N <- rpois(5, 100)
mk_stat(d$y, N=sum(N), stat="mean", fpc=TRUE)
#> # A tibble: 1 x 5
#> n point var se cv
#> <int> <dbl> <dbl> <dbl> <dbl>
#> 1 250 45.0 54.9 7.41 0.165
The output gives a summary of the estimated population statistics,
including the point estimate, variance of the estimate, standard error,
and coefficient of variation. Each column relates to the estimated
statistic, not the population. The fpc
argument determines whether or
not to use a finite population correction. In any case, the population
size, N
, must be supplied.
If the data are grouped by some stratified variable (fixed effect), then
a similar data summary can be obtained using make_summary
.
# Tabulate summary statistics for a stratified probability sample
make_summary(.data=d, .group=region, .y=y, .group_N=N, .fpc = TRUE, .stat="mean")
#> # A tibble: 5 x 7
#> region N n point var se cv
#> <int> <dbl> <int> <dbl> <dbl> <dbl> <dbl>
#> 1 1 101 50 35.6 16.4 4.04 0.114
#> 2 2 104 50 41.1 20.4 4.52 0.110
#> 3 3 99 50 44.8 30.5 5.52 0.123
#> 4 4 97 50 48.1 37.6 6.13 0.127
#> 5 5 100 50 55.5 59.5 7.71 0.139
This table summary can be useful for comparing stratified groups, but it is often an intermediate computation to estimating a total population statistic. Using the pipe operator, we can compose functions to obtain that stratified result as follows:
d %>% make_summary(.group=region, .y=y, .group_N=N, .fpc = TRUE, .stat="mean") %>%
stratified_stat(.stat="mean", .fpc=TRUE)
#> # A tibble: 1 x 6
#> Ntotal ntotal point var se cv
#> <dbl> <int> <dbl> <dbl> <dbl> <dbl>
#> 1 501 250 45.0 0.0650 0.255 0.00567
Data plots can also be produced using s_plot
, a thin wrapper on
ggplot2
. The wrapper supports easy plotting of histograms and boxplots
for simple random samples or grouped data.
d %>% s_plot(y, region, kind="hist")
Demos of the Law of Large Numbers and the Central Limit Theorem are also provided. The Law of Large Numbers draws samples from a particular Gamma distribution, giving no flexibility to the user. The Central Limit Theorem demo actually provides flexibility for choosing the distribution. In the CLT case, a user can choose between a binormal distribution, a uniform distribution, or a gamma distribution with any choice of parameterizations. Simply calling the functions will generate plots using GGPlot as a backend, as can be seen next.
lln <- lln_demo()
lln$plot
clt <- clt_demo()
clt$plot
The central limit theorem demo also gives draws from the sampled distribution for visualizing what the parent distribution looked like, along with the theoretical parameters from the distribution, as can be seen below.
clt$data %>% as_tibble %>% ggplot(aes(x=value)) +
geom_histogram(bars=30, color="blue", fill="lightblue") +
geom_vline(xintercept = clt$popmean, size=1.2) +
labs(
title = "CLT Parent Distribution",
subtitle = "Binormal Distribution with means 30, 50 and variance 36",
x = "X",
y = "Frequency"
)
#> Warning: Ignoring unknown parameters: bars
#> `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.