knitr::opts_chunk$set( collapse = TRUE, comment = "#>", fig.path = "README-" )
"Could a different reasonable researcher analyzing the same data come to a different conclusion?" This is a question that gets to the heart of whether or not a scientific result can be trusted, yet it's a question that traditional statistical inference has little to say about.
We introduce the hacking interval, which is the range of a numerical scientific result that could be obtained over a set of reasonable dataset and hyperparameter manipulations. Hacking intervals come in two varieties:
x^2
and discretizing continuous variables into indicator variables based on quantiles). This package computes tethered and prescriptively-constrained hacking intervals for linear models. We also compute an interval that considers both types of hacking at once; in other words, it is the range result that could be achieved by a model that fits the data almost as well as any model obtainable via the prescriptive constraints. At most one of the manipulations in the prescriptive constraints is permitted at a time.
devtools::install_github("beauCoker/hacking")
Start with a dataset. For this demo we'll generate a toy dataset data
.
set.seed(0) N = 50 # Number of observations data <- data.frame( y = rnorm(N), # Response variable (continuous) w = rbinom(N, 1, .5), # Treatment variable (binary) X = matrix(rnorm(N*3), nrow=N), # Covariates included in base model Z = matrix(rnorm(N*3), nrow=N) # Covariates excluded from base model )
Next, fit a linear model with lm
. We'll call this the "base" model.
mdl <- lm(y ~ w + X.1*X.2, data=data) (beta_0 <- mdl$coefficients['w'])
So, the ordinary least squares estimate for the coefficient beta_0
on the treatment variable w
in the base model is about r trunc(beta_0*100)/100
. A standard question in statistics is to ask, "what could happen if I estimated beta_0
using a different dataset drawn from the same distribution?" This is conceptually what a standard confidence interval tells you. It can be computed with R
's built-in confint
function:
(ci <- confint(mdl)['w',])
Now we get to the hacking interval part. What if instead you ask, "what if the scientist that reported this estimate threw out some important observations, or messed with the data in some other way? What's the range of estimates that could have been reported?" This is conceptually what a hacking interval tells you. For linear models, it can be computed with the hackint_lm
function in our package. The parameter theta
tells you what percentage of loss is tolerated for the tethered variety of hacking.
library(hacking) output <- hackint_lm(mdl, data, theta=0.1, treatment = 'w')
In the output above, LB
and UB
stand for lower bound and upper bound. It says that a tethered hacking interval around the base model is (r trunc(output$tethered*100)/100
), a prescriptively constrained hacking interval around the base model is (r trunc(output$constrained*100)/100
), and a hacking interval that considers both types hacking is (r trunc(output$tethered_and_constrained*100)/100
). Notice either of the tethered intervals are wider than the standard confidence interval, (r trunc(confint(mdl)['w',]*100)/100
), but note that hacking intervals and standard confidence invervals measure different forms of uncertainty. Either could be larger, and hacking intervals needn't even be centered on the point estimate.
hackint_lm
works by enumerating all of the manipulations within the prescriptive constraints and, for each manipulation, computing the ordinary least squares coefficient estimate as well as a tethered hacking interval around this estimate (i.e., where the model under the manipulation is essentially treated as a new base model). This complete list is available as a dataframe, with Estimate
denoting the coefficient estimate and (LB
,UB
) denoting the tethered hacking interval. The prescriptively-constrained hacking interval is the range of Estimate
and the type that considers prescriptive constraints and tethering is given by the mininum of LB
and the maximum of UB
. This list is useful for diagosing which manipulations are most impactful. The output is sorted by the largest absolute difference largest_diff
of any value (LB
, Estimate
, or UB
) from beta_0
:
output$hacks_all
The optional argument frac_remove_obs
(default value 1) specifies the fraction of observations that are considered for removal in evaluating prescriptively-constrained hacking intervals. If frac_remove_obs
is less than 1, then only observations with the highest Cook's distance are considered for removal. This will speed up computation for small datasets but does not provide any theoretical guarantees of accuracy.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.