analyze: Analyze fusion output
In ummel/fusionModel: Data fusion and analysis of synthetic data in R

analyze

R Documentation

Analyze fusion output

Description

Calculation of point estimates and associated margin of error for analyses using fused/synthetic microdata. Can calculate means, proportions, sums, counts, and medians, optionally across population subgroups.

Usage

analyze(
  x,
  implicates,
  static = NULL,
  weight = NULL,
  rep_weights = NULL,
  by = NULL,
  fun = NULL,
  var_scale = 4,
  cores = 1
)

Arguments

`x`	List. Named list specifying the desired analysis type(s) and the associated target variable(s). Example: `x = list(mean = c("v1", "v2"), median = "v3")` translates as: "Return the mean value of variables v1 and v2 and the median of v3". Supported analysis types include `mean`, `sum`, and `median`. Mean and sum automatically return proportions and counts, respectively, if the target variable is a factor. Target variables must be in `implicates`, `static`, or a data.frame returned by a custom `fun`.
`implicates`	Data frame. Implicates of synthetic (fused) variables. Typically generated by fuse. The implicates should be row-stacked and identified by integer column "M".
`static`	Data frame. Optional static (non-synthetic) variables that do not vary across implicates. Note that `nrow(static) = nrow(implicates) / max(implicates$M)` and the row-ordering is assumed to be consistent between `static` and `implicates`.
`weight`	Character. Name of the observation weights column in `static`. If NULL (default), uniform weights are assumed.
`rep_weights`	Character. Optional vector of replicate weight columns in `static`. If provided, the returned margin of errors reflect additional variance due to uncertainty in sample weights.
`by`	Character. Optional column name(s) in `implicates` or `static` (typically factors) that collectively define the set of population subgroups for which each analysis is executed. If `NULL`, analysis is done for the whole sample.
`fun`	Function. Optional function applied to input data prior to executing analyses. Can be used to do non-conventional/custom analyses.
`var_scale`	Scalar. Factor by which to scale the unadjusted replicate weight variance. This is determined by the survey design. The default (`var_scale = 4`) is appropriate for ACS and RECS.
`cores`	Integer. Number of cores used. Only applicable on Unix systems.

Details

At a minimum, the user must supply synthetic implicates (typically generated by fuse). Inputs are checked for consistent dimensions.

If implicates contains only a single implicate and rep_weights = NULL, the "typical" standard error is returned with a warning to make sure the user is aware of the situation.

Estimates and standard errors for the requested analysis are calculated separately for each implicate. The final point estimate is the mean estimate across implicates. The final standard error is the pooled SE across implicates, calculated using Rubin's pooling rules (1987).

When replicate weights are provided, the standard errors of each implicate are calculated via the variance of estimates across replicates. Calculations leverage data.table operations for speed and memory efficiency. The within-implicate variance is calculated around the point estimate (rather than around the mean of the replicates). This is equivalent to mse = TRUE in svrepdesign. This seems to be the appropriate method for most surveys.

If replicate weights are NOT provided, the standard errors of each implicate are calculated using variance within the implicate. For means, the ratio variance approximation of Cochran (1977) is used, as this is known to be a good approximation of bootstrapped SE's for weighted means (Gatz and Smith 1995). For proportions, a generalization of the unweighted SE formula is used (see here).

Value

A data.table reporting analysis results, possibly across subgroups defined in by. The returned quantities include:

N: Number of observations used for the analysis.
y: Target variable.
level: Levels of factor target variables.
type: Type of estimate returned: mean, proportion, sum, count, or median.
est: Point estimate.
moe: Margin of error associated with the 90% confidence interval.

References

Cochran, W. G. (1977). Sampling Techniques (3rd Edition). Wiley, New York.

Gatz, D.F., and Smith, L. (1995). The Standard Error of a Weighted Mean Concentration — I. Bootstrapping vs Other Methods. Atmospheric Environment, vol. 29, no. 11, 1185–1193.

Rubin, D.B. (1987). Multiple imputation for nonresponse in surveys. Hoboken, NJ: Wiley.

Examples

# Build a fusion model using RECS microdata
fusion.vars <- c("electricity", "natural_gas", "aircon")
predictor.vars <- names(recs)[2:12]
fsn.path <- train(data = recs, y = fusion.vars, x = predictor.vars)

# Generate 30 implicates of the 'fusion.vars' using original RECS as the recipient
sim <- fuse(data = recs, fsn = fsn.path, M = 30)
head(sim)

#---------

# Multiple types of analyses can be done at once
# This calculates estimates using the full sample
result <- analyze(x = list(mean = c("natural_gas", "aircon"),
                           median = "electricity",
                           sum = c("electricity", "aircon")),
                  implicates = sim,
                  weight = "weight")

View(result)

#-----

# Mean electricity consumption, by climate zone and urban/rural status
result1 <- analyze(x = list(mean = "electricity"),
                  implicates = sim,
                  static = recs,
                  weight = "weight",
                  by = c("climate", "urban_rural"))

# Same as above but including sample weight uncertainty
# Note that only the first 30 replicate weights are used internally
result2 <- analyze(x = list(mean = "electricity"),
                  implicates = sim,
                  static = recs,
                  weight = "weight",
                  rep_weights = paste0("rep_", 1:96),
                  by = c("climate", "urban_rural"))

# Helper function for comparison plots
pfun <- function(x, y) {plot(x, y); abline(0, 1, lty = 2)}

# Inclusion of replicate weights does not affect estimates, but it does
# increase margin of error due to uncertainty in RECS sample weights
pfun(result1$est, result2$est)
pfun(result1$moe, result2$moe)

# Notice that relative uncertainty declines with subset size
plot(result1$N, result1$moe / result1$est)

#-----

# Use a custom function to perform more complex analyses
# Custom function should return a data frame with non-standard target variables

my_fun <- function(data) {

  # Manipulate 'data' as desired
  # All variables in 'implicates' and 'static' are available

  # Construct electricity consumption per square foot
  kwh_per_ft2 <- data$electricity / data$square_feet

  # Binary (T/F) indicator if household uses natural gas
  use_natural_gas <- data$natural_gas > 0

  # Return data.frame of custom variables to be analyzed
  data.frame(kwh_per_ft2, use_natural_gas)
}

# Do analysis using variables produced by custom function
# Can included non-custom target variables as well
result <- analyze(x = list(mean = c("kwh_per_ft2", "use_natural_gas", "electricity")),
                  implicates = sim,
                  static = recs,
                  weight = "weight",
                  fun = my_fun)

ummel/fusionModel documentation built on June 1, 2025, 11 p.m.