analyze | R Documentation |
Calculation of point estimates and associated margin of error for analyses using fused/synthetic microdata. Can calculate means, proportions, sums, counts, and medians, optionally across population subgroups.
analyze(
x,
implicates,
static = NULL,
weight = NULL,
rep_weights = NULL,
by = NULL,
fun = NULL,
var_scale = 4,
cores = 1
)
x |
List. Named list specifying the desired analysis type(s) and the associated target variable(s). Example: |
implicates |
Data frame. Implicates of synthetic (fused) variables. Typically generated by fuse. The implicates should be row-stacked and identified by integer column "M". |
static |
Data frame. Optional static (non-synthetic) variables that do not vary across implicates. Note that |
weight |
Character. Name of the observation weights column in |
rep_weights |
Character. Optional vector of replicate weight columns in |
by |
Character. Optional column name(s) in |
fun |
Function. Optional function applied to input data prior to executing analyses. Can be used to do non-conventional/custom analyses. |
var_scale |
Scalar. Factor by which to scale the unadjusted replicate weight variance. This is determined by the survey design. The default ( |
cores |
Integer. Number of cores used. Only applicable on Unix systems. |
At a minimum, the user must supply synthetic implicates (typically generated by fuse). Inputs are checked for consistent dimensions.
If implicates
contains only a single implicate and rep_weights = NULL
, the "typical" standard error is returned with a warning to make sure the user is aware of the situation.
Estimates and standard errors for the requested analysis are calculated separately for each implicate. The final point estimate is the mean estimate across implicates. The final standard error is the pooled SE across implicates, calculated using Rubin's pooling rules (1987).
When replicate weights are provided, the standard errors of each implicate are calculated via the variance of estimates across replicates. Calculations leverage data.table
operations for speed and memory efficiency. The within-implicate variance is calculated around the point estimate (rather than around the mean of the replicates). This is equivalent to mse = TRUE
in svrepdesign
. This seems to be the appropriate method for most surveys.
If replicate weights are NOT provided, the standard errors of each implicate are calculated using variance within the implicate. For means, the ratio variance approximation of Cochran (1977) is used, as this is known to be a good approximation of bootstrapped SE's for weighted means (Gatz and Smith 1995). For proportions, a generalization of the unweighted SE formula is used (see here).
A data.table reporting analysis results, possibly across subgroups defined in by
. The returned quantities include:
Number of observations used for the analysis.
Target variable.
Levels of factor target variables.
Type of estimate returned: mean, proportion, sum, count, or median.
Point estimate.
Margin of error associated with the 90% confidence interval.
Cochran, W. G. (1977). Sampling Techniques (3rd Edition). Wiley, New York.
Gatz, D.F., and Smith, L. (1995). The Standard Error of a Weighted Mean Concentration — I. Bootstrapping vs Other Methods. Atmospheric Environment, vol. 29, no. 11, 1185–1193.
Rubin, D.B. (1987). Multiple imputation for nonresponse in surveys. Hoboken, NJ: Wiley.
# Build a fusion model using RECS microdata
fusion.vars <- c("electricity", "natural_gas", "aircon")
predictor.vars <- names(recs)[2:12]
fsn.path <- train(data = recs, y = fusion.vars, x = predictor.vars)
# Generate 30 implicates of the 'fusion.vars' using original RECS as the recipient
sim <- fuse(data = recs, fsn = fsn.path, M = 30)
head(sim)
#---------
# Multiple types of analyses can be done at once
# This calculates estimates using the full sample
result <- analyze(x = list(mean = c("natural_gas", "aircon"),
median = "electricity",
sum = c("electricity", "aircon")),
implicates = sim,
weight = "weight")
View(result)
#-----
# Mean electricity consumption, by climate zone and urban/rural status
result1 <- analyze(x = list(mean = "electricity"),
implicates = sim,
static = recs,
weight = "weight",
by = c("climate", "urban_rural"))
# Same as above but including sample weight uncertainty
# Note that only the first 30 replicate weights are used internally
result2 <- analyze(x = list(mean = "electricity"),
implicates = sim,
static = recs,
weight = "weight",
rep_weights = paste0("rep_", 1:96),
by = c("climate", "urban_rural"))
# Helper function for comparison plots
pfun <- function(x, y) {plot(x, y); abline(0, 1, lty = 2)}
# Inclusion of replicate weights does not affect estimates, but it does
# increase margin of error due to uncertainty in RECS sample weights
pfun(result1$est, result2$est)
pfun(result1$moe, result2$moe)
# Notice that relative uncertainty declines with subset size
plot(result1$N, result1$moe / result1$est)
#-----
# Use a custom function to perform more complex analyses
# Custom function should return a data frame with non-standard target variables
my_fun <- function(data) {
# Manipulate 'data' as desired
# All variables in 'implicates' and 'static' are available
# Construct electricity consumption per square foot
kwh_per_ft2 <- data$electricity / data$square_feet
# Binary (T/F) indicator if household uses natural gas
use_natural_gas <- data$natural_gas > 0
# Return data.frame of custom variables to be analyzed
data.frame(kwh_per_ft2, use_natural_gas)
}
# Do analysis using variables produced by custom function
# Can included non-custom target variables as well
result <- analyze(x = list(mean = c("kwh_per_ft2", "use_natural_gas", "electricity")),
implicates = sim,
static = recs,
weight = "weight",
fun = my_fun)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.