knitr::opts_chunk$set( collapse = TRUE, error = FALSE, warning = FALSE, message = FALSE, comment = "#>" )
default_options <- options()
With the rapid advancement of artificial intelligence and machine learning (AI/ML), researchers from a wide range of disciplines increasingly use predictions from pre-trained algorithms as outcome variables in statistical analyses. However, reifying algorithmically-derived values as measured outcomes may lead to biased estimates and anti-conservative inference (Hoffman et al., 2023). The statistical challenges encountered when drawing inference on predicted data (IPD) include:
Several works have proposed methods for IPD, including post-prediction inference (PostPI) by Wang et al., 2020, prediction-powered inference (PPI) and PPI++ by Angelopoulos et al., 2023a and Angelopoulos et al., 2023b, and post-prediction adaptive inference (PSPA) by Miao et al., 2023. To enable researchers and practitioners interested in these state-of-the-art methods, we have developed ipd
, a open-source R
package that implements these methods under the umbrella of IPD.
This vignette provides a guide to using the ipd
package, including installation instructions, examples of data generation, model fitting, and usage of custom methods. The examples demonstrate the package's functionality.
Following the notation of Miao et al., 2023, we assume we have the following data structure:
$$\mathbb{E}\left[Y^{\mathcal{U}} \mid \boldsymbol{X}^{\mathcal{U}}\right] = g^{-1}\left(\boldsymbol{X}^{\mathcal{U}'}\beta\right),$$
where (\beta) is a vector of regression coefficients and (g(\cdot)) is a given link function, such as the identity link for linear regression, the logistic link for logistic regression, or the log link for Poisson regression. However, in practice, we do not observe (Y^\mathcal{U}) in the 'unlabeled' subset of the data. Instead, these values are replaced by the predicted (f(X^\mathcal{U})). We can use methods for IPD to obtain corrected estimates and standard errors when we replace these unobserved (Y^\mathcal{U}) by (f(X^\mathcal{U})).
To install the development version of ipd
from GitHub, you can use the devtools
package:
#-- Install devtools if it is not already installed install.packages("devtools") #-- Install the ipd package from GitHub devtools::install_github("ipd-tools/ipd")
We provide a simple example to demonstrate the basic use of the functions included in the ipd
package.
#-- Load ipd Package library(ipd)
The ipd
packages provides a unified function, simdat
, for generating synthetic datasets for various models. The function currently supports "mean", "quantile", "ols", "logistic", and "poisson" models.
n
: A vector of size 3 indicating the sample size in the training, labeled, and unlabeled data sets.effect
: A float specifying the regression coefficient for the first variable of interest (defaults to 1).sigma_Y
: A float specifying the residual variance for the generated outcome.model
: The type of model to be generated. Must be one of "mean"
, "quantile"
, "ols"
, "logistic"
, or "poisson"
.The simdat
function generate a data.frame with three subsets: (1) an independent "training" set with additional observations used to fit a prediction model, and "labeled" and "unlabeled" sets which contain the observed and predicted outcomes and the simulated features of interest.
We can generate a continuous outcome and relevant predictors for linear regression as follows. The simdat
function generates four independent covariates, $X_1$, $X_2$, $X_3$, and $X_4$, and the outcome:
$$Y = \text{effect}\times X_1 + \frac{1}{2}\times X_2^2 + \frac{1}{3}\times X_3^3 + \frac{1}{4}\times X_4^2 + \varepsilon_y$$
where effect
is one of the function arguments and $\varepsilon_y \sim N(0, \text{sigma_Y})$, with sigma_Y
being another argument. Here, the simdat
function generates three subsets of data, a "training" subset, a "labeled" subset, and an "unlabeled" subset, based on the sizes in n
. It then learns the prediction rule for the outcome in the "training" subset using a generalized additive model and predicts these outcomes in the "labeled" and "unlabeled" subsets:
#-- Generate a Dataset for Linear Regression set.seed(123) n <- c(10000, 500, 1000) dat_ols <- simdat(n = n, effect = 1, sigma_Y = 4, model = "ols") #-- Print First 6 Rows of Training, Labeled, and Unlabeled Subsets options(digits=2) head(dat_ols[dat_ols$set_label == "training",]) head(dat_ols[dat_ols$set_label == "labeled",]) head(dat_ols[dat_ols$set_label == "unlabeled",])
The simdat
function provides observed and unobserved outcomes for both the labeled and unlabeled datasets, though in practice the observed outcomes are not in the unlabeled set. We can visualize the relationships between these variables in the labeled data subset:
library(tidyverse) library(patchwork) #-- Plot example labeled data dat_labeled <- dat_ols[dat_ols$set_label == "labeled",] p1 <- dat_labeled |> pivot_longer(Y:f) |> mutate( name = factor(name, levels = c("f", "Y"), labels = c("Predicted ", "Observed ")) |> fct_rev()) |> ggplot(aes(x = X1, y = value, group = name, color = name, fill = name)) + theme_bw() + coord_fixed(1/3) + geom_abline(slope = 1, intercept = 0) + geom_point(alpha = 0.5) + geom_smooth(method = "lm") + scale_x_continuous(limits = c(-2.5, 2.5)) + scale_y_continuous(limits = c(-7.5, 7.5)) + scale_color_manual(values = c("#AA4AC4", "#00C1D5")) + scale_fill_manual(values = c("#AA4AC4", "#00C1D5")) + labs( x = "\nCovariate", y = "Outcome\n", fill = NULL, color = NULL, title = "Outcomes vs. Covariate") + theme( legend.direction = "horizontal", legend.position = "inside", legend.position.inside = c(0.5, 0.91), axis.title = element_text(size = 8), axis.text = element_text(size = 8), title = element_text(size = 8)) p2 <- dat_labeled |> ggplot(aes(x = f, y = Y)) + theme_bw() + coord_fixed(1) + geom_abline(slope = 1, intercept = 1) + geom_point(color = "#00C1D5", alpha = 0.5) + geom_smooth(method = "lm", color = "#00C1D5") + scale_x_continuous(limits = c(-7.5, 7.5)) + scale_y_continuous(limits = c(-7.5, 7.5)) + labs( x = "\nPredicted Outcome", y = "Observed Outcome\n", title = "Predicted vs. Observed Outcomes") + theme( axis.title = element_text(size = 8), axis.text = element_text(size = 8), title = element_text(size = 8)) fig1 <- (p1 + theme(plot.margin = unit(c(0,20,0, 0), "pt"))) + (p2 + theme(plot.margin = unit(c(0,20,0,20), "pt"))) + plot_annotation(tag_levels = "A") fig1
We can see that:
As another example, we can generate a binary outcome and relevant predictors for logistic regression as follows:
#-- Generate a Dataset for Logistic Regression set.seed(123) dat_logistic <- simdat(n = n, effect = 3, sigma_Y = 1, model = "logistic") #-- Print First 6 Rows of Training, Labeled, and Unlabeled Subsets head(dat_logistic[dat_logistic$set_label == "training",]) head(dat_logistic[dat_logistic$set_label == "labeled",]) head(dat_logistic[dat_logistic$set_label == "unlabeled",])
dat_logistic_labeled <- dat_logistic[dat_logistic$set_label == "labeled",] dat_logistic_labeled_summ <- dat_logistic_labeled |> group_by(Y, f) |> count() |> ungroup() |> mutate(Y = factor(Y), f = factor(f), pct = n / sum(n) * 100, fill = if_else(Y == f, 1, 0))
We can again visualize the relationships between the true and predicted outcome variables in the labeled data subset and see that r paste0(sprintf("%2.1f", sum(dat_logistic_labeled_summ[dat_logistic_labeled_summ$fill == 1, "pct"])), "%")
observations are correctly predicted:
dat_logistic_labeled_summ |> ggplot(aes(x = f, y = Y, fill = fill)) + geom_tile() + coord_equal() + geom_text( aes(label = paste0(n, " (", sprintf("%2.1f", pct), "%)")), vjust = 1) + scale_x_discrete(expand = c(0,0), limits = rev) + scale_y_discrete(expand = c(0,0)) + scale_fill_gradient(high = "#00C1D5", low = "#FFB500") + theme(legend.position = "none")
We compare two non-IPD approaches to analyzing the data to methods included in the ipd
package.
#--- Fit the Naive Regression lm(f ~ X1, data = dat_ols[dat_ols$set_label == "unlabeled",]) |> summary()
#--- Fit the Classic Regression lm(Y ~ X1, data = dat_ols[dat_ols$set_label == "labeled",]) |> summary()
You can fit the various IPD methods to your data and obtain summaries using the provided wrapper function, ipd()
:
#-- Specify the Formula formula <- Y - f ~ X1 #-- Fit the PostPI Bootstrap Correction nboot <- 200 ipd::ipd(formula, method = "postpi_boot", model = "ols", data = dat_ols, label = "set_label", nboot = nboot) |> summary()
#-- Fit the PostPI Analytic Correction ipd::ipd(formula, method = "postpi_analytic", model = "ols", data = dat_ols, label = "set_label") |> summary()
#-- Fit the PPI Correction ipd::ipd(formula, method = "ppi", model = "ols", data = dat_ols, label = "set_label") |> summary()
#-- Fit the PPI++ Correction ipd::ipd(formula, method = "ppi_plusplus", model = "ols", data = dat_ols, label = "set_label") |> summary()
#-- Fit the PSPA Correction ipd::ipd(formula, method = "pspa", model = "ols", data = dat_ols, label = "set_label") |> summary()
We also show how these methods compare for logistic regression.
#--- Fit the Naive Regression glm(f ~ X1, family = binomial, data = dat_logistic[dat_logistic$set_label == "unlabeled",]) |> summary()
#--- Fit the Classic Regression glm(Y ~ X1, family = binomial, data = dat_logistic[dat_logistic$set_label == "labeled",]) |> summary()
You can again fit the various IPD methods to your data and obtain summaries using the provided wrapper function, ipd()
:
#-- Specify the Formula formula <- Y - f ~ X1 #-- Fit the PostPI Bootstrap Correction nboot <- 200 ipd::ipd(formula, method = "postpi_boot", model = "logistic", data = dat_logistic, label = "set_label", nboot = nboot) |> summary()
#-- Fit the PPI Correction ipd::ipd(formula, method = "ppi", model = "logistic", data = dat_logistic, label = "set_label") |> summary()
#-- Fit the PPI++ Correction ipd::ipd(formula, method = "ppi_plusplus", model = "logistic", data = dat_logistic, label = "set_label") |> summary()
#-- Fit the PSPA Correction ipd::ipd(formula, method = "pspa", model = "logistic", data = dat_logistic, label = "set_label") |> summary()
The package also provides custom print
, summary
, tidy
, glance
, and augment
methods to facilitate easy model inspection:
#-- Fit the PostPI Bootstrap Correction nboot <- 200 fit_postpi <- ipd::ipd(formula, method = "postpi_boot", model = "ols", data = dat_ols, label = "set_label", nboot = nboot)
The print
method gives an abbreviated summary of the output from the ipd
function:
#-- Print the Model print(fit_postpi)
The summary
method gives more detailed information about the estimated coefficients, standard errors, and confidence limits:
#-- Summarize the Model summ_fit_postpi <- summary(fit_postpi) #-- Print the Model Summary print(summ_fit_postpi)
The tidy
method organizes the model coefficients into a tidy format.
#-- Tidy the Model Output tidy(fit_postpi)
The glance
method returns a one-row summary of the model fit.
#-- Get a One-Row Summary of the Model glance(fit_postpi)
The augment
method adds model predictions and residuals to the original dataset.
#-- Augment the Original Data with Fitted Values and Residuals augmented_df <- augment(fit_postpi) head(augmented_df)
options(default_options)
The ipd
package offers a suite of functions for conducting inference on predicted data. With custom methods for printing, summarizing, tidying, glancing, and augmenting model outputs, ipd
streamlines the process of IPD-based inference in R
. We will continue to develop this package to include more targets of inference and IPD methods as they are developed, as well as additional functionality for analyzing such data. For further information and detailed documentation, please refer to the function help pages within the package, e.g.,
?ipd
For questions, comments, or any other feedback, please contact the developers (ssalerno@fredhutch.org).
Contributions are welcome! Please open an issue or submit a pull request on GitHub.
This package is licensed under the MIT License.
Any scripts or data that you put into this service are public.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.