Many modern problems leverage machine learning methods to predict outcomes based on observable covariates. Subsequent statistical modeling, e.g. to understand population trends in outcomes, often involves treating predicted outcomes as interchangeable with observed data. postpi
is an R package we developed to correct downstream statistical inference using outcomes predicted with an arbitrary machine learning method. Our package can be applied to both continuous and categorical data. postpi
contains three functions:
postpi_relate
postpi
postpi_relate()
, and an inference formula.postpi_der
Prepare a data set with observed outcomes and predicted outcomes/probabilities for each predicted categories, and covariates of interest for subsequant inferential analyses.
Split the data set into testing and validation sets. On testing set, use postpi_relate()
to estimate a relationship model.
On validation set, use postpi()
/postpi_der()
to conduct inferential analyses.
Note: If users have a subset of observed outcomes but no predicted outcomes, they should split the data set into three sets -- training, testing, and validation sets. On training set, use a machine learning method to train a prediction model, and apply it to testing and validation sets to get predicted results. Then users should repeat step 2-3 above to obtain inference results.
knitr::opts_chunk$set(echo = TRUE) devtools::load_all()
library(dplyr) library(kableExtra)
In this example, we use a data set RINdata
available in the package. RINdata
contains a column of observed RIN values named actual
, a column of predicted RIN values named prediction
obtained from a previous trained data set, and 200 columns of gene expression regions. We want to study associations between RINs and gene expression levels. A detailed description of the RINdata
data set is available at our paper Post-prediction inference.
RINdata
and split data into testing and validation sets. data("RINdata") data <- RINdata ## split the data into testing and validation sets using rsample package set.seed(2019) data_split <- rsample::initial_split(data, prop = 1/2) testing <- rsample::training(data_split) validation <- rsample::testing(data_split)
RINdata
, and pass it to postpi_relate()
to estimate a relationship model named rel_model
.## fit the relationship model on testing set rel_model <- testing %>% select(actual, predictions) %>% postpi_relate(actual)
predictions ~ region_10
such that we want to relate gene expression levels in region 10 to predicted RINs. Then we pass in the validation set, the defined inference formula, and the relationship model rel_model
estimated above to the inference function postpi()
. In postpi()
we estiamte inference results using a bootstrap approach and we obtain the results in a tidy table format named results_postpi
.inf_formula <- predictions ~ region_10 ## fit the inference model on validation set and make iap corrections using bootstrap approach results_postpi <- validation %>% postpi(rel_model, inf_formula)
postpi_der
. In postpi()
we estiamte inference results using a derivation approach and we obtain the results in a tidy table format named results_der
.## fit the inference model on validation set and make iap corrections using derivation approach results_der <- testing %>% postpi_der(actual, predictions, validation, inf_formula)
results_postpi
from a bootstrap approach postpi()
, and results_der
from a derivation approach. We compare them to the no correction
approach (i.e. inference results using predicted outcomes), and the gold standard
(i.e. inference results using observed outcomes). Note in practice we don't have observed outcomes for all samples so we perform inferences on predicted outcomes. In this example, we reserved the observed outcomes on validation set only for comparison purposes. To better compare results, we highlight the result of standard errors in red color and t-statistics in blue color.
We first print out the results from the gold standard
. We observe that the standard error for the estimate is 0.20 (red) and the t-statistic is -4.05 (blue).
## show the inference results on validation set using observed outcomes broom::tidy(lm(update(inf_formula, actual ~ .), validation))[-1,] %>% kable() %>% kable_styling("striped", full_width = F) %>% column_spec(3, bold = T, color = "red") %>% column_spec(4, bold = T, color = "blue")
From our bootstrap approach through function postpi()
we get the results with standard error 0.020 (red) and t-statistic -4.93 (blue).
kable(results_postpi) %>% kable_styling("striped", full_width = F) %>% column_spec(3, bold = T, color = "red") %>% column_spec(4, bold = T, color = "blue")
From our derivation approach through function postpi_der()
we get the results with standard error 0.023 (red) and t-statistic -4.74 (blue).
kable(results_der) %>% kable_styling("striped", full_width = F) %>% column_spec(3, bold = T, color = "red") %>% column_spec(4, bold = T, color = "blue")
We observe that the standard errors we obtained using our method are very similar to the gold standard, and the t-statistics from our method is also close to the gold standard. However, if we decide to use predicted RINs as the outcome variable directly in the subsequant inference analyses and with no corrections to the model estimates, we obtain the following results: the standard error is 0.017 (red), smaller than the gold standard 0.020; the t-statistic is -6.87, with absolute value larger than the gold standard -4.05.
## show the inference results on validation set without corrections broom::tidy(lm(inf_formula, validation))[-1,] %>% kable() %>% kable_styling("striped", full_width = F) %>% column_spec(3, bold = T, color = "red") %>% column_spec(4, bold = T, color = "blue")
By comparing the inference results using our package and the no correction result, we conclude that our method works well to correct the inference towards the gold standard!
In the above example, we define the inference model for one covariate of interest as predictions ~ region_10
. Our method can also be applied to correct inferences for a covariate adjusting for other covariates. For example, we define an inference model as predictions ~ region_10 + region_20 + region_50
. We repeat step 2-5 in above analyses:
## fit the relationship model on testing set rel_model <- testing %>% select(actual, predictions) %>% postpi_relate(actual) inf_formula <- predictions ~ region_10 + region_20 + region_50 ## fit the inference model on validation set and make iap corrections using bootstrap approach results_postpi <- validation %>% postpi(rel_model, inf_formula) results_der <- testing %>% postpi_der(actual, predictions, validation, inf_formula)
We print out the results from the gold standard
, our method results_postpi
and results_der
, and the no correction
.
## gold standard broom::tidy(lm(update(inf_formula, actual ~ .), validation))[-1,] %>% kable() %>% kable_styling("striped", full_width = F) %>% column_spec(3, bold = T, color = "red") %>% column_spec(4, bold = T, color = "blue") ## postpi kable(results_postpi) %>% kable_styling("striped", full_width = F) %>% column_spec(3, bold = T, color = "red") %>% column_spec(4, bold = T, color = "blue") ## postpi_der kable(results_der) %>% kable_styling("striped", full_width = F) %>% column_spec(3, bold = T, color = "red") %>% column_spec(4, bold = T, color = "blue") ## no correction broom::tidy(lm(inf_formula, validation))[-1,] %>% kable() %>% kable_styling("striped", full_width = F) %>% column_spec(3, bold = T, color = "red") %>% column_spec(4, bold = T, color = "blue")
Again, we observe that we obtain more accurate standard errors (red), and t-statistics (blue) using our method results_postpi
and results_der
compared to the no correction
approach.
In this example, we use a data set TISSUEdata
available in the package. TISSUEdata
contains a column of observed tissue types (breast / adipose tissues) named actual
, a column of predicted tissue types named predictions
, two columns of the probabilities of each predicted tissue types named Breast
and Adipose Tissue
obtained from a previous trained data set, and 2281 columns of gene expression regions. We want to study associations between breast / adipose tissue types and gene expression levels. A detailed description of the TISSUEdata
data set is available at our paper Post-prediction inference.
TISSUEdata
, clean it and make the class of observed and predicted tissue type columns (actual
and predictions
) to be factor. Then we split the clean data set into testing and validation sets. data("TISSUEdata") TISSUE_data <- TISSUEdata colnames(TISSUE_data)[colnames(TISSUE_data) == "Adipose Tissue"] <- "Adipose_Tissue" TISSUE_data$predictions <- as.character(TISSUE_data$predictions) TISSUE_data$actual <- as.character(TISSUE_data$actual) TISSUE_data[TISSUE_data == "Adipose Tissue"] <- "Adipose_Tissue" TISSUE_data$actual <- as.factor(TISSUE_data$actual) TISSUE_data$predictions <- as.factor(TISSUE_data$predictions) ## split the data into testing and validation sets using rsample package set.seed(2019) data_split <- rsample::initial_split(TISSUE_data, prop = 1/2) testing <- rsample::training(data_split) validation <- rsample::testing(data_split)
postpi_relate()
to estimate a relationship model between observed outcomes and predicted probabilities named rel_model
.# fit the relationship model on testing set rel_model <- testing %>% select(actual, Adipose_Tissue, Breast) %>% postpi_relate(actual)
predictions ~ region_200
such that we want to relate gene expression levels in region 200 to predicted tissue types. Then we pass in the validation set, the defined inference formula, and the relationship model rel_model
estimated above to the inference function postpi()
. In postpi()
we estiamte inference results using a bootstrap approach and we obtain the results in a tidy table format named results_postpi
. In this example the data set we use contains categorical data, so we use bootstrap approach only.inf_formula <- predictions ~ region_200 ## fit the inference model on validation set and make iap corrections using bootstrap approach results_postpi <- validation %>% postpi(rel_model, inf_formula)
results_postpi
on validation set from a bootstrap approach. We compare it to the no correction approach (i.e. inference results using predicted outcomes), and the gold standard (i.e. inference results using observed outcomes). Note in practice we don't have observed outcomes for all samples so we perform inferences on predicted outcomes. In this example, we reserved the observed outcomes on validation set only for comparison purposes. Again we highlight standard errors in red color and t-statistics in blue color.
We first print out the results from the gold standard
. We observe that the standard error for the estimate is 0.25 (red) and the t-statistic is 3.80 (blue).
## show the inference results on validation set using observed outcomes broom::tidy(glm(update(inf_formula, actual ~ .), validation, family = binomial(link = "logit")))[-1,] %>% kable() %>% kable_styling("striped", full_width = F) %>% column_spec(3, bold = T, color = "red") %>% column_spec(4, bold = T, color = "blue")
From our bootstrap approach through function postpi()
we get the results with standard error 0.23 (red) and t-statistic 3.60 (blue).
kable(results_postpi) %>% kable_styling("striped", full_width = F) %>% column_spec(3, bold = T, color = "red") %>% column_spec(4, bold = T, color = "blue")
We observe that the standard error and t-statistic we obtained using our method are close to the gold standard. However, if we decide to use predicted tissue types as the outcome variable directly in the subsequant inference analyses and with no corrections to the model estimates, we obtain the following results: the standard error is 0.20 (red), smaller than the gold standard 0.25; the t-statistic is 5.11, with absolute value larger than the gold standard 3.80.
## show the inference results on validation set without corrections broom::tidy(glm(inf_formula, validation, family = binomial(link = "logit")))[-1,] %>% kable() %>% kable_styling("striped", full_width = F) %>% column_spec(3, bold = T, color = "red") %>% column_spec(4, bold = T, color = "blue")
By comparing the inference results using our package and the no correction result, again we conclude that our method works well to correct the inference towards the gold standard!
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.