In ummel/fusionModel: Data fusion and analysis of synthetic data in R

knitr::opts_chunk$set(
  echo = TRUE,
  cache = TRUE, # Useful for testing/editing; will create /README_cache directory
  comment = NA,
  fig.path = "man/figures/README-"
)

Overview

fusionModel enables variables unique to a "donor" dataset to be statistically simulated for (i.e. fused to) a "recipient" dataset. Variables common to both the donor and recipient are used to model and simulate the fused variables. The package provides a simple and efficient interface for general data fusion in R, leveraging state-of-the-art machine learning algorithms from Microsoft's LightGBM framework. It also provides tools for analyzing synthetic/simulated data, calculating uncertainty, and validating fusion output.

fusionModel was developed to allow statistical integration of microdata from disparate social surveys. It is the data fusion workhorse underpinning the larger fusionACS data platform under development at the Socio-Spatial Climate Collaborative. In this context, fusionModel is used to fuse variables from a range of social surveys onto microdata from the American Community Survey, allowing for analysis and spatial resolution otherwise impossible.

Motivation

The desire to “fuse” or otherwise integrate independent datasets has a long history, dating to at least the early 1970’s (Ruggles and Ruggles 1974; Alter 1974). Social scientists have long recognized that large amounts of unconnected data are “out there” – usually concerning the characteristics of households and individuals (i.e. microdata) – which we would, ideally, like to integrate and analyze as a whole. This aim falls under the general heading of “Statistical Data Integration” (SDI) (Lewaa et al. 2021).

The most prominent examples of data fusion have involved administrative record linkage. This consists of exact matching or probabilistic linking of independent datasets, using observable information like social security numbers, names, or birth dates of individuals. Record linkage is the gold standard and can yield incredibly important insights and high levels of statistical confidence, as evidenced by the pioneering work of Raj Chetty and colleagues.

However, record linkage is rarely feasible for the kinds of microdata that most researchers use day-to-day (nevermind the difficulty of accessing administrative data). While the explosion of online tracking and social network data will undoubtedly offer new lines of analysis, for the time being, at least, social survey microdata remain indispensable. The challenge and promise recognized 50 years ago by Nancy and Richard Ruggles remains true today:

Unfortunately, no single microdata set contains all of the different kinds of information required for the problems which the economist wishes to analyze. Different microdata sets contain different kinds of information…A great deal of information is collected on a sample basis. Where two samples are involved the probability of the same individual appearing in both may be very small, so that exact matching is impossible. Other methods of combining the types of information contained in the two different samples into one microdata set will be required. (Ruggles and Ruggles 1974; 353-354)

Practitioners regularly impute or otherwise predict a variable or two from one dataset on to another. Piecemeal, ad hoc data fusion is a common necessity of quantitative research. Proper data fusion, on the other hand, seeks to systematically combine “two different samples into one microdata set”.

The size and nature of the samples involved and the intended analyses strongly influence the choice of data integration technique and the structure of the output. This has led to the relevant literature being both diverse and convoluted, as practitioners take on different data “setups” and objectives. In the context of fusionACS, we are interested in the following problem:

We have microdata from two independent surveys, A and B, that sample the same underlying population and time period (e.g. occupied U.S. households nationwide in 2018). We specify that A is the “recipient” dataset and B is the “donor”. The goal is to generate a new dataset, C, that has the original survey responses of A plus a realistic representation of how each respondent in A might have answered the questionnaire of survey B. To do this, we identify a set of common/shared variables X that both surveys solicit. We then attempt to fuse a set of variables unique to B – call them Z, the “fusion variables” – onto the original microdata of A, conditional on X.

Methodology

The fusion strategy implemented in the fusionModel package borrows and expands upon ideas from the statistical matching (D’Orazio et al. 2006), imputation (Little and Rubin 2019), and data synthesis (Drechsler 2011) literatures to create a flexible data fusion tool. It employs variable-k, conditional expectation matching that leverages high-performance gradient boosting algorithms. The package accommodates fusion of many variables, individually or in blocks, and efficient computation when the recipient is large relative to the donor.

Specifically, the goal was to create a data fusion tool that meets the following requirements:

Accommodate donor and recipient datasets with divergent sample sizes
Handle continuous, categorical, and semi-continuous (zero-inflated) variable types
Ensure realistic values for fused variables
Scale efficiently for larger datasets
Fuse variables “one-by-one” or in “blocks”
Employ a data modeling approach that:
Makes no distributional assumptions (i.e. non-parametric)
Automatically detects non-linear and interaction effects
Automatically selects predictor variables from a potentially large set
Ability to prevent overfitting (e.g. cross-validation)

Complete methodological details are available in the fusionACS Guidebook (INSERT LINK).

Installation

devtools::install_github("ummel/fusionModel")

library(fusionModel)

Simple fusion

suppressMessages(suppressWarnings(library(tidyverse)))
suppressMessages(suppressWarnings(library(fst)))  # Just to void messages when it loads within code

The package includes example microdata from the 2015 Residential Energy Consumption Survey (see ?recs for details). For real-world use cases, the donor and recipient data are typically independent and vary in sample size. For illustrative purposes, we will randomly split the recs microdata into separate "donor" and "recipient" datasets with an equal number of observations.

# Rows to use for donor dataset
d <- seq(from = 1, to = nrow(recs), by = 2)

# Create donor and recipient datasets
donor <- recs[d, c(2:16, 20:22)]
recipient <- recs[-d, 2:14]

# Specify fusion and shared/common predictor variables
predictor.vars <- names(recipient)
fusion.vars <- setdiff(names(donor), predictor.vars)

The recipient dataset contains r ncol(recipient) variables that are shared with donor. These shared "predictor" variables provide a statistical link between the two datasets. fusionModel exploits the information in these shared variables.

predictor.vars

There are r length(fusion.vars) "fusion variables" unique to donor. These are the variables that will be fused to recipient. This includes a mix of continuous and categorical (factor) variables.

# The variables to be fused
sapply(donor[fusion.vars], class)

We create a fusion model using the train() function. The minimal usage is shown below. See ?train for additional function arguments and options. By default, this results in a ".fsn" (fusion) object being saved to "fusion_model.fsn" in the current working directory.

# Train a fusion model
fsn.model <- train(data = donor, 
                   y = fusion.vars, 
                   x = predictor.vars)

To fuse variables to recipient, we simply pass the recipient data and path of the .fsn model to the fuse() function. Each variable specified in fusion.vars is fused in the order provided. By default, fuse() generates a single implicate (version) of synthetic outcomes. Later, we'll work with multiple implicates to perform proper analysis and uncertainty estimation.

# Fuse 'fusion.vars' to the recipient
sim <- fuse(data = recipient, 
            fsn = fsn.model)

Let's look at the the recipient dataset's fused/simulated variables. Note that your results will look different, because each call to fuse() generates a unique, probabilistic set of outcomes.

head(sim)

We can do some quick sanity checks to compare the distribution of the fusion variables in donor with those in sim. This, at least, confirms that the fusion output is not obviously wrong. Later, we'll perform a formal internal validation exercise using multiple implicates.

sim <- data.frame(sim)

# Compare means of the continuous variables
cbind(donor = colMeans(donor[fusion.vars[3:5]]), sim = colMeans(sim[fusion.vars[3:5]]))

# Compare frequencies of categorical variable classes
cbind(donor = table(donor$insulation), sim = table(sim$insulation))
cbind(donor = table(donor$aircon), sim = table(sim$aircon))

And we can look at kernel density plots of the non-zero values for the continuous variables to see if the univariate distributions in donor are generally similar in sim.

# Univariate naturally continuous case
# Compare density plots for select continuous variables

# Create 'pdata' data frame for subsequent plots
#rec <- mutate(cbind(recipient, sim), dataset = "fused")
fsd <- mutate(sim, dataset = "fused")
don <- mutate(donor, dataset = "donor")
pdata <- bind_rows(don, fsd)

# The "natural" continuous variables (SPECIFY MANUALLY)
ncont <- c("square_feet", "electricity", "natural_gas")

# Density plots
pdata[c("dataset", ncont)] %>%
  pivot_longer(cols = -1L) %>%
  filter(value != 0) %>%
  ggplot(aes(x = value, color = dataset)) +
  geom_density(size = 1) +
  theme(legend.position = "top") +
  facet_wrap(~ name, scales = "free")
#ggtitle("Distribution of continuous variables (non-zero values)")

Advanced fusion

For this call to train(), we specify a set of hyperparameters to search over when training each LightGBM gradient boosting model (see ?train for details). The hyperparameters can be used to tune the underlying GBM models for better cross-validated performance. We also set nfolds = 10 (default is 5) to indicate the number of cross-validation folds to use. Since this requires additional computation, the cores argument is used to enable parallel processing.

# Train a fusion model with variable blocks
fsn.model <- train(data = donor, 
                   y = fusion.vars, 
                   x = predictor.vars,
                   nfolds = 10,
                   hyper = list(boosting = c("gbdt", "goss"),
                                num_leaves = c(10, 30),
                                feature_fraction = c(0.7, 0.9)),
                   cores = 2)

We generally want to create multiple versions of the simulated fusion variables -- called implicates -- in order to reduce bias in point estimates and calculate associated uncertainty. We can do this using the M argument within fuse(). Here we generate 10 implicates; i.e. 10 unique, probabilistic representations of what the recipient records might look like with respect to the fusion variables.

# Fuse multiple implicates to the recipient
sim10 <- fuse(data = recipient, 
              fsn = fsn.model,
              M = 10)

Note that each implicate in sim10 is identified by the "M" variable/column.

head(sim10)
table(sim10$M)

Analyzing fused data

The fused values are inherently probabilistic, reflecting uncertainty in the underlying statistical models. Multiple implicates are needed to calculate unbiased point estimates and associated uncertainty for any particular analysis of the data. In general, more implicates is preferable but requires more computation.

Since proper analysis of multiple implicates can be rather cumbersome – both from a coding and mathematical standpoint – the analyze() function provides a convenient way to calculate point estimates and associated uncertainty for common analyses. Potential analyses currently include variable means, proportions, sums, counts, and medians, (optionally) calculated for population subgroups.

For example, to calculate the mean value of the "electricity" variable across all observations in the recipient dataset, we do the following.

analyze(x = list(mean = "electricity"),
        implicates = sim10)

When the response variable is categorical, analyze() automatically returns the proportions associated with each factor level.

analyze(x = list(mean = "aircon"),
        implicates = sim10)

If we want to perform an analysis across subsets of the recipient population -- for example, calculate the mean value of "electricity" by household "income" -- we can use the by and static arguments. We see that mean electricity consumption increases with household income.

analyze(x = list(mean = "electricity"),
        implicates = sim10,
        static = recipient,
        by = "income")

It is also possible to do multiple kinds of analyses in a single call to analyze(). For example, the following call calculates the mean value of "natural_gas" and "square_feet", the median value of "square_feet", and the sum of "electricity" (i.e. total consumption) and "insulation" (i.e. total count of each level). All of these estimates are calculated for each population subgroup defined by the intersection of "race" and "urban_rural" status.

result <- analyze(x = list(mean = c("natural_gas", "square_feet"),
                           median = "square_feet",
                           sum = c("electricity", "insulation")),
                  implicates = sim10,
                  static = recipient,
                  by = c("race", "urban_rural"))

We can then (for example) isolate the results for white households in rural areas. Notice that the mean estimate of "square_feet" exceeds the median, reflecting the skewed distribution.

subset(result, race == "White" & urban_rural == "Rural")

More complicated analyses can be performed using the custom fun argument to analyze(). See the Examples section of ?analyze.

Validating fusion models

The validate() function provides a convenient way to perform internal validation tests on synthetic variables that have been fused back onto the original donor data. This allows us to assess the quality of the underlying fusion model; it is analogous to assessing model skill by comparing predictions to the observed training data.

validate() compares analytical results derived using the multiple-implicate fusion output with those derived using the original donor microdata. By performing analyses on population subsets of varying size, validate() estimates how the synthetic variables perform for analyses of varying difficulty/complexity. It computes fusion variable means and proportions for subsets of the full sample – separately for both the observed and fused data – and then compares the results.

First, we fuse multiple implicates of the fusion.vars using the original donor data -- not the recipient data, as we did previously.

sim <- fuse(data = donor,
            fsn = fsn.model,
            M = 40)

Next, we pass the sim results to validate(). The argument subset_vars specifies that we want the validation exercise to compare observed (donor) and simulated point estimates across population subsets defined by "income", "age", "race", and "education". See ?validate for more details.

valid <- validate(observed = donor,
                  implicates = sim,
                  subset_vars = c("income", "age", "race", "education"))

The validate() output includes ggplot2 graphics that helpfully summarize the validation results. For example, the plot below shows how the observed and simulated point estimates compare, using median absolute percent error as the performance metric. We see that the synthetic data do a very good job reproducing the point estimates for all fusion variables when the population subset in question is reasonably large. For smaller subsets -- i.e. more difficult analyses due to small sample size -- "square_feet", "natural_gas", and "electricity" remain well modeled, but the error increases more rapidly for "aircon" and "insulation". This information is useful for understanding what kind of reliability we can expect for particular variables and types of analyses, given the underlying fusion model and data.