analyze_fusionACS: Analyze fusionACS microdata
In ummel/fusionModel: Data fusion and analysis of synthetic data in R

analyze_fusionACS

R Documentation

Analyze fusionACS microdata

Description

For fusionACS internal use only. Calculation of point estimates and associated uncertainty (margin of error) for analyses using ACS and/or fused donor survey variables. Efficiently computes means, medians, sums, proportions, and counts, optionally across population subgroups. The use of native ACS weights or ORNL UrbanPop synthetic population weights is automatically determined given the requested geographic resolution. Requires a local /fusionData directory in the working directory path with assumed file structure and conventions.

Usage

analyze_fusionACS(
  analyses,
  year,
  respondent = "household",
  by = NULL,
  area = NULL,
  fun = NULL,
  M = Inf,
  R = Inf,
  cores = 1,
  version_up = 2,
  force_up = FALSE
)

Arguments

`analyses`	List. Specifies the desired analyses. Each analysis is a formula. See Details and Examples.
`year`	Integer. One or more years for which microdata are pooled to compute `analyses` (i.e. ACS recipient year). Currently defaults to `year = 2015:2019`, if the `by` variables indicate a sub-PUMA analysis requiring UrbanPop weights.
`respondent`	Character. Should the `analyses` be computed using `"household"`- or `"person"`-level microdata?
`by`	Character. Optional variable(s) that collectively define the set of population subgroups for which each analysis is computed. Can be a mix of geographic (e.g. census tract) and/or socio-demographic microdata variables (e.g. poverty status); the latter may be existing variables on disk or custom variables created on-the-fly via `fun()`. If `NULL`, analysis is done for the whole (national) sample.
`area`	Call. Optional unquoted call specifying a geographic area within which to compute the `analyses`. Useful for restricting the study area to a manageable size.
`fun`	Function. Optional function for creating custom microdata variables that cannot be accommodated in `analyses`. Must take `data` and (optionally) `weight` as the only function arguments and must return a `data.frame` with number of rows equal to `nrow(data)`. See Details and Examples.
`M`	Integer. The first `M` implicates are used. Set `M = Inf` to use all available implicates.
`R`	Integer. The first `R` replicate weights are used. Set `R = Inf` to use all available replicate weights.
`cores`	Integer. Number of cores used for multithreading in `collapse-package` functions.
`version_up`	Integer. Use `version_up = 2` to access 10-replicate weights for 17 metro areas or `version_up = 3` to access 40-replicate weights for 17 metro areas.
`force_up`	Logical. If `TRUE`, force use of UrbanPop weights even if the requested analysis can be done using native ACS weights.

Details

Allowable geographic units of analysis specified in by are currently limited to: region, division, state, cbsa10, puma10, county10, cousubfp10 (county subdivision), zcta10 (zip code), tract10 (census tract), and bg10 (block group).

The final point estimates are the mean estimates across implicates. The final margin of error is derived from the pooled standard error across implicates, calculated using Rubin's pooling rules (1987). The within-implicate standard error's are calculated using the replicate weights.

Each entry in the analyses list is a formula of the format Z ~ F(E), where Z is an optional, user-friendly name for the analysis, F is an allowable “outer function”, and E is an “inner expression” containing one or more microdata variables. For example:

mysum ~ mean(Var1 + Var2)

In this case, the outer function is mean(). Allowable outer functions are: mean(), sum(), median(), sd(), and var(). When the inner expression contains more than one variable, it is first evaluated and then F() is applied to the result. In this case, an internal variable X = Var1 + Var2 is generated across all observations, and then mean(X) is computed.

If no inner expression is desired, the analyses list can use the following convenient syntax to apply a single outer function to multiple variables:

mean = c("Var1", "Var2")

The inner expression can also utilize any function that takes variable names as arguments and returns a vector with the same length as the inputs. This is useful for defining complex operations in a separate function (e.g. microsimulation). For example:

myfun = function(Var1, Var2) {Var1 + Var2}

mysum ~ mean(myfun(Var1, Var2))

The use of sum() or mean() with an inner expression that returns a categorical vector automatically results in category-wise weighted counts and proportions, respectively. For example, the following analysis would fail if evaluated literally, since mean() expects numeric input but the inner expression returns character. But this is interpreted as a request to return weighted proportions for each categorical outcome.

myprop ~ mean(ifelse(Var1 > 10 , 'Yes', 'No'))

analyze_fusionACS() uses "fast" versions of the allowable outer functions, as provided by fast-statistical-functions in the collapse package. These functions are highly optimized for weighted, grouped calculations. In addition, outer functions mean(), sum(), and median() enjoy the use of platform-independent multithreading across columns when cores > 1. Analyses with numerical inner expressions are processed using a series of calls to collap with unique observation weights. Analyses with categorical inner expressions utilize a series of calls to fsum.

Value

A tibble reporting analysis results, possibly across subgroups defined in by. The returned quantities include:

lhs: Optional analysis name; the "left hand side" of the analysis formula.
rhs: The "right hand side" of the analysis formula.
type: Type of analysis: sum, mean, median, prop(ortion) or count.
level: Factor levels for categorical analyses; NA otherwise.
N: Mean number of valid microdata observations across all implicates and replicates; i.e. the sample size used to construct the estimate.
est: Point estimate; mean estimate across all implicates and replicates.
moe: Margin of error associated with the 90% confidence interval.
se: Standard error of the estimate.
df: Degrees of freedom used to calculate the margin of error.
cv: Coefficient of variation; conventional scale-independent measure of estimate reliability. Calculated as: 100 * moe / 1.645 / est

References

Rubin, D.B. (1987). Multiple imputation for nonresponse in surveys. Hoboken, NJ: Wiley.

Examples

# Analysis using ACS native weights for year 2017, by PUMA, in South Atlantic Census Division
# Uses all available implicates and replicate weights
test <- analyze_fusionACS(analyses = list(high_burden ~ mean(dollarel / hincp > 0.05)),
                          year = 2017,
                          by = "puma10",
                          area = division == "South Atlantic")

# Analysis using UrbanPop 2015-2019 weights, by tract, in Utah (actually Salt Lake City metro given current UrbanPop data)
# Uses 5 (of possible 20) fusion implicates for RECS "dollarel" variable
# Uses 5 (of possible 10) UrbanPop replicate weights
test <- analyze_fusionACS(analyses = list(median_burden ~ median(dollarel / hincp)),
                          year = 2015:2019,
                          by = "tract10",
                          area = state_name == "Utah",
                          M = 5,
                          R = 5)

# User function to create custom variables from microdata
# Variables explicitly referenced in my_fun() are automatically loaded into 'data' within analyze_fusionACS()
# Variables returned by my_fun() may be used in 'by' or inner expressions of 'analyses'
my_fun <- function(data) {
  require(tidyverse, quietly = TRUE)
  data %>%
    mutate(elderly = agep >= 65,
           energy_expend = dollarel + dollarfo + dollarlp + dollarng,
           energy_burden = energy_expend / hincp,
           energy_burden = ifelse(hincp < 5000, NA, energy_burden)) %>%
    select(elderly, energy_burden, energy_expend)
}

# Analysis using UrbanPop 2015-2019 weights, by zip code and elderly head of household, in Atlanta CBSA
test <- analyze_fusionACS(analyses = list(energy_burden ~ mean(energy_burden),
                                          at_risk ~ mean(energy_burden > 0.075 | acequipm_pub == "No air conditioning")),
                          year = 2015:2019,
                          by = c("zcta10", "elderly"),
                          area = cbsa10 == "12060",
                          fun = my_fun,
                          M = 5,
                          R = 5)

ummel/fusionModel documentation built on June 1, 2025, 11 p.m.