library(dplyr) library(ggplot2) library(ciTools) library(MASS) library(arm) set.seed(20171102)
In this vignette we will discuss the current ability of ciTools
to handle generalized linear models. Small simulations will be
provided in addition to examples that show how to use ciTools
to
quantify uncertainty in fitted values. Primarily we focus on the
Logistic and Poisson models, but ciTools
's method for handling GLMs
is not limited to these models.
Note that the binomial logistic regression is handled in a separate vignette.
Generalized linear models are an extension of linear models that seek to accommodate certain types of non-linear relationships. The manner in which the non-linearity is addressed also allows users to perform inferences on data that are not strictly continuous. GLMs are the most common model type that allow for a non-linear relationship between the response variable $y$ and covariates $X$. Recall that linear regression directly predicts a continuous response from the linear predictor, $X \beta$. A GLM extends this linear prediction scheme. A GLM consists of:
A linear predictor $X \beta$.
A monotonic and everywhere differentiable link function $g$, which transforms the linear predictor: $\hat{y} = g^{-1}(X \hat{\beta})$.
A response distribution: $f(y|\mu)$ from the exponential family with expected value $\mu = g^{-1} (X \beta)$.
Other components, such as over dispersion parameters and off-set terms are possible, but not common to all GLMs. The most common GLMs in practice are the Logistic model (Bernoulli response with logit link) and the Poisson model with log link function. These are detailed below for convenience.
Logistic Regression: $$ \begin{equation} \begin{split} & y|X \sim \mathrm{Binomial}(1, p) \ & g(p) = \log \left( \frac{p}{1-p} \right) = X\beta \ & \mathbb{E}[y|X] = p = \frac{\exp(X \beta)}{ 1 + \exp(X \beta)} \end{split} \end{equation} $$
Poisson Regression with the $\log$ link function: $$ \begin{equation} \begin{split} & y|X \sim \mathrm{Poisson}(\lambda) \ & g(\lambda) = \log \left( \lambda \right) = X\beta \ & \mathbb{E}[y|X] = \lambda = \exp(X \beta) \end{split} \end{equation} $$
Due to the variety of options available, fitting generalized linear
models is more complicated than fitting linear models. In R, glm
is the starting point for handling GLM fits, and is currently the
only GLM fitting function that is supported by ciTools
. We can use
ciTools
in tandem with glm
to fit and analyze Logistic, Poisson,
Quasipoisson, Gamma, Guassian and certain other models.
ciTools
methods for GLMsUnlike linear models, interval estimates pertaining to GLMs generally
do not have clean, parametric forms. This is problematic because from
a computational point of view we would prefer a solution that is fast
and relatively simple. Parametric interval estimates are available in
certain cases, and wherever available, ciTools
will choose to
implement them by default. Below we detail precisely which computations
ciTools
performs when one of the core functions (add_ci
, add_pi
,
add_probs
, add_quantile
) is called on an object of class glm
.
For any model fit by glm
, add_ci()
may compute confidence
intervals for predictions using either a parametric method or a
bootstrap. The parametric method computes confidence intervals on the
scale of the linear predictor $X \beta$ and transforms the intervals
to the response level through the inverse link function
$g^{-1}$. Confidence intervals on the linear predictor level are
computed using a Normal distribution for Logistic and Poisson
regressions or a $t$ distribution otherwise. (This is consistent with
the default behavior for the predict.glm
function.) The intervals
are given by the following expressions:
$$ \begin{equation} g^{-1}\left(x'\hat{\beta} \pm z_{1 - \alpha/2} \sqrt{\hat{\sigma}^2x'(X'X)^{-1} x}\right) \end{equation} $$
for Binomial and Poisson GLMs or
$$ \begin{equation} \label{eq:glmci} g^{-1}\left(x'\hat{\beta} \pm t_{1 - \alpha/2, n-p-1} \sqrt{\hat{\sigma}^2x'(X'X)^{-1} x}\right) \end{equation} $$
for other generalized linear models. In these expressions, we regard
$X$ as the model matrix from the original fit, and $x$ as the "new
data" matrix. The default method is parametric and is called with
add_ci(data, fit, ...)
. This is the method we generally recommend
for constructing confidence intervals for model predictions.
The bootstrap method is called with add_ci(df, fit, type = "boot",
...)
and was included originally for making comparisons against the
parametric method. There are multiple methods of bootstrap for
regression models (resampling cases, resampling residuals, parametric,
etc.). The bootstrap method employed by ciTools
in add_ci.glm()
resamples cases and iteratively refits the model (using the default
behavior of boot::boot
) to determine confidence intervals. After
collecting the bootstrap replicates, a bias-corrected and accelerated
(BCa) bootstrap confidence interval is formed for each point in the
sample df
.
Although there are several methods for computing bootstrap confidence
intervals, we don't provide options to compute all of these
types of intervals in ciTools
. BCa intervals are slightly larger than
parametric intervals, but are less biased than other types of
bootstrapped intervals, including percentile based intervals. We may
consider adding more types of bootstrap intervals to ciTools
in the
future.
For comparison, we show an example of the confidence intervals for the probability estimates of a Logistic regression model.
x <- rnorm(100, mean = 5) y <- rbinom(n = 100, size = 1, prob = invlogit(-20 + 4*x)) df <- data.frame(x = x, y = y) fit <- glm(y ~ x, family = binomial)
We use ciTools
to compute the two types of confidence intervals,
then we stack the dataframes together.
df1 <- add_ci(df, fit, names = c("lwr", "upr"), alpha = 0.1) %>% mutate(type = "parametric") df2 <- add_ci(df, fit, type = "boot", names = c("lwr", "upr"), alpha = 0.1, nSims = 500) %>% mutate(type = "bootstrap") df <- bind_rows(df1, df2)
ggplot(df, aes(x = x, y = y)) + ggtitle("Logistic Regression", subtitle = "Model fit (black line) and confidence intervals (gray)") + geom_jitter(height = 0.01) + geom_ribbon(aes(ymin = lwr, ymax = upr), alpha = 0.4) + geom_line(aes(x =x , y = pred), size = 2) + facet_grid(~type)
Our two confidence interval methods mostly agree, but the bootstrap method produces slightly wider intervals.
df3 <- filter(df, type == "parametric") df4 <- filter(df, type == "bootstrap") %>% rename(lboot = lwr, uboot = upr) %>% bind_cols(df3)
Another perspective on the difference between these two interval calculation methods is shown below. It's a fairly clear that the BCa intervals (red) indeed exhibit little bias, but are not as tight as the parametric intervals (purple). This is expected behavior because the bootstrap confidence intervals don't "know" about the model assumptions. In practice, if the sample size is small or the model is false, these interval estimates may exhibit more disagreement.
ggplot(df4, aes(x = x, y = y)) + ggtitle("Logistic Regression") + geom_jitter(height = 0.01) + geom_ribbon(aes(ymin = lboot, ymax = uboot), alpha = 0.4, fill = "red") + geom_ribbon(aes(ymin = lwr, ymax = upr), alpha = 0.4, fill = "royalblue") + geom_line(aes(x = x , y = pred...3), size = 2)
If the sample size increases, we will find that the two estimates increasingly agree and converge to $0$ in width. Note that we do not calculate prediction intervals for $y$ when $y$ is Bernoulli distributed because the support of $y$ is exactly ${0,1}$.
Generally, parametric prediction intervals for GLMs are not
available. The solution ciTools
takes is to perform a parametric
bootstrap on the model fit, then take quantiles on the bootstrapped
data produced for each observation. The procedure is performed via
arm::sim
. The method of the parametric bootstrap is described by
the following algorithm:
Fit the GLM, and collect the regression statistics $\hat{\beta}$ and $\hat{\mathrm{Cov}}(\hat{\beta})$. Set the number of simulations, $M$.
Simulate $M$ draws of the regression coefficients, $\hat{\beta}_{*}$, from $N(\hat{\beta}, \hat{\mathrm{Cov}}(\hat{\beta}))$, where $\hat{\mathrm{Cov}}(\hat{\beta}) = \hat{\sigma}^2 (X'X)^{-1}$.
Simulate $[y_{}|x]$ from the response distribution with mean $g^{-1}(x \hat{\beta}_{})$ and a variance determined by the response distribution.
Determine the $\alpha/2$ and $1-\alpha/2$ quantiles of the simulated response $[y_{*}|x]$ for each $x$.
The parametric bootstrap method propagates the uncertainty in the regression effects $\hat{\beta}$ into the simulated draws from the predictive distribution.
Generally, there are many different ways to calculate the quantiles of
an empirical distribution, but the approach that ciTools
takes
ensures that estimated quantiles lie in the support set of the
response. The choice we make corresponds to setting type = 1
in
quantile()
.
We have seen in our simulations that the parametric bootstrap provides
interval estimates with approximately nominal probability
coverage. The unfortunate side effect of opting to construct
prediction intervals through a parametric bootstrap is that the
parameters of the predictive distributions need to hard coded for each
model. For this reason, ciTools
does not have complete coverage of
all models one could fit with glm
.
One exception to this scheme are GLMs with Gaussian errors. In this
case we may parametrically calculate prediction intervals. Under
Gaussian errors, glm()
permits the use of the link functions
"identity", "log", and "inverse". The corresponding models are given
by the following expressions:
$$ \begin{equation} \label{eq:gauss-link} \begin{split} y = X\beta + \epsilon\ y = \exp(X\beta) + \epsilon \ y = \frac{1}{X\beta} + \epsilon \end{split} \end{equation} $$
The conditional response distribution in each case may be written $y \sim N(g^{-1}(X\beta), \sigma^2)$, and prediction intervals may be computed parametrically. Parametric prediction intervals are therefore constructed via
$$ \begin{equation} \label{eq:gauss-pi} g^{-1}(x'\beta) \pm t_{1-\alpha/2, n-p-1} \sqrt{\hat{\sigma}^2 + \hat{\sigma}^2x'(X'X)^{-1}x} \end{equation} $$
for Gaussian GLMs. As in the linear model, $\hat{\sigma}^2$ estimates the predictive uncertainty and $\hat{\sigma}^2 x(X'X)^{-1}x$ estimates the inferential uncertainty in the fitted values. Note that $g^{-1}(X \hat{\beta})$ is a maximum likelihood estimate for the parameter $g^{-1}(X \beta)$, by the functional invariance of maximum likelihood estimation.
At this point, the only GLMs that we support with add_pi()
are Guassian, Poisson, Binomial, Gamma, and Quasipoisson (all of which
need to be fit with glm()
). The models not supported by add_pi.glm
are
inverse Gaussian, Quasi, and Quasi-binomial.
Poisson regression is usually the first line of defense against
count data, so we wish to present a complete example of quantifying
uncertainty for this type of model with ciTools
. For simplicity
we fit a model on fake data.
We use rnorm
to generate a covariate, but the randomness of $x$ has
no bearing on the model.
x <- rnorm(100, mean = 0) y <- rpois(n = 100, lambda = exp(1.5 + 0.5*x)) df <- data.frame(x = x, y = y) fit <- glm(y ~ x , family = poisson(link = "log"))
As seen previously, the commands in ciTools
are "pipeable". Here,
we compute confidence and prediction intervals for a model fit at
the $90\%$ level. The warning message only serves to
remind the user that precise quantiles cannot be formed for
non-continuous distributions.
df_ints <- df %>% add_ci(fit, names = c("lcb", "ucb"), alpha = 0.1) %>% add_pi(fit, names = c("lpb", "upb"), alpha = 0.1, nSims = 20000) %>% head()
As with other methods available in ciTools
the requested
statistics are computed, appended to the data frame, and returned to
the user as a data frame.
df_ints %>% ggplot(aes(x = x, y = y)) + geom_point(size = 2) + ggtitle("Poisson Regression", subtitle = "Model fit (black line), with prediction intervals (gray), confidence intervals (dark gray)") + geom_line(aes(x = x, y = pred), size = 1.2) + geom_ribbon(aes(ymin = lcb, ymax = ucb), alpha = 0.4) + geom_ribbon(aes(ymin = lpb, ymax = upb), alpha = 0.2)
Since the response $y$ is count data, and the method we used to
determine the intervals involves simulation, we find that ciTools
will produce "jagged" bounds when all the intervals are plotted
simultaneously. Increasing the number of simulations using the nSims
argument in add_pi
can help reduce some of this unsightliness.
We may also wish to compute response-level probabilities and
quantiles. ciTools
can also handle these estimates with
add_probs()
and add_quantile()
respectively. We use the same parametric
bootstrap approach for estimating quantiles and probabilities that we
employed for add_pi()
. Once again, an error message reminds the user
that their support is not continuous.
df %>% add_probs(fit, q = 10) %>% add_quantile(fit, p = 0.4) %>% head()
A common problem with the Poisson model is the presence of over-dispersion. Recall that for the Poisson model, we require that the variance and mean agree, however this is practically a strict and unreasonable modeling assumption. A quasipoisson model is one remedy: it estimates an additional dispersion parameter and will provide a better fit. Under the quasipoisson assumption
$$ \mathbb{E}[y|X] = \mu = \exp (X \beta) $$
and
$$ \mathbb{V}\mathrm{ar}[y|X] = \phi \mu $$
Quasi models are not full maximum likelihood models, however it is possible to embed a Quasipoisson in the Negative Binomial framework using
$$ \mathrm{QP}(\mu, \theta) = \mathrm{NegBin}(\mu, \theta = \frac{\mu}{\phi - 1}) $$
Where NegBin is the parameterization of the Negative Binomial
distribution used by glm.nb
in the MASS
library. This model for
the negative binomial distribution, a continuous mixture of Poisson
random variables with gamma distributed means, is preferred over
that classical parameterization in applications. The preference
stems from the fact that it allows for non-integer-valued $\theta$.
Warning: As in Gelman and Hill's Data Analysis using Regression
and Multilevel/Hierarchical Model, ciTools
does not simulate the
uncertainty in the over-dispersion parameter
$\hat{\phi}$. According to our simulations, dropping this
uncertainty from the parametric bootstrap has a negligible effect on
the coverage probabilities. While the distribution of $\hat{\phi}$
is asymptotically Normal, it is very likely that the finite sample
estimator has a skewed distribution. Approximating this
distribution for use in a parametric bootstrap is ongoing
research. As it stands, the prediction intervals we form for
over-dispersed models tend to be conservative.
Negative binomial regression (via glm.nb
) is implemented as a
separate method in ciTools
, and is an alternative to quasipoisson
regression. For more information on the difference between these
two models, we recommend Jay Ver Hoef and Peter Boveng's
Quasi-poisson vs. Negative Binomial Regression: How Should We
Model Overdispersed Count Data?
Again, we generate fake data. The dispersion parameter is set to $5$ in the quasipoisson model.
x <- runif(n = 100, min = 0, max = 2) mu <- exp(1 + x) y <- rnegbin(n = 100, mu = mu, theta = mu/(5 - 1))
The data is over-dispersed:
df <- data.frame(x = x, y = y) fit <- glm(y ~ x, family = quasipoisson(link = "log")) summary(fit)$dispersion
But ciTools
can still construct appropriate interval estimates
for the range of a new observation:
df_ints <- add_ci(df, fit, names = c("lcb", "ucb"), alpha = 0.05) %>% add_pi(fit, names = c("lpb", "upb"), alpha = 0.1, nSims = 20000)
ggplot(df_ints, aes(x = x, y = y)) + ggtitle("Quasipoisson Regression", subtitle = "Model fit (black line), with Prediction intervals (gray), Confidence intervals (dark gray)") + geom_point(size = 2) + geom_line(aes(x = x, y = pred), size = 1.2) + geom_ribbon(aes(ymin = lcb, ymax = ucb), alpha = 0.4) + geom_ribbon(aes(ymin = lpb, ymax = upb), alpha = 0.2)
The darker region represents the confidence intervals formed by
add_ci
and the lighter intervals are formed by add_pi
. Again,
intervals are "jagged" because the response the is not continuous
and the bounds are formed through a simulation.
A simulation study was performed to examine the empirical coverage
probabilities of prediction intervals formed using the parametric
bootstrap. We focus on these intervals because we could not find
results in the literature addressing their performance. Our simulation
is not comprehensive, so users of ciTools
should exercise care when
using these methods. In each simulation, a simple $y = g^{-1}(mx +
b)$ model is fit on a variable number of observations.
New observations were generated from the true model to determine if the empirical coverage was close to the nominal level. The mean interval width was also recorded, as were standard errors of the estimated coverage and mean interval width.
Note that in contrast to a study of confidence intervals, we do not expect interval widths to shrink to $0$ as sample size tends to infinity. This is due to the predictive error in the conditional response distribution, which is not a factor in the construction of confidence intervals.
We take the same approach in the simulation study of each of the four models described below:
A Poisson model is fit with the log-link function.
$$ y|x \sim \mathrm{Poisson}(\lambda = \exp(1 + 2 x)), \quad x \in (1, 2) $$
For each sample size, $10,000$ simulations were performed. We find that observed coverage levels are biased conservative by about $0.5\%$. This bias is likely a side effect of the type of quantile we calculate on the bootstrapped data. Interval widths generally decrease with sample size, however there is an increase from sample size $500$ to $1000$, which is within simulation error.
pois <- read.csv("pois_pi_results.csv") %>% dplyr::select(-total_time) %>% rename(nominal_level = level) %>% dplyr::select(sample_size, everything()) knitr::kable(pois)
ggplot(pois, aes(x=sample_size, y=cov_prob)) + geom_point(size = 2) + geom_line(size = 1.5) + geom_errorbar(aes(ymin=cov_prob - 2*cov_prob_se, ymax = cov_prob + 2*cov_prob_se), width=.1) + ylim(0, 1) + xlab("Sample Size (log scale)") + ylab("Coverage Probs +/- 2 SEs") + ggtitle("Poisson Regression -- P.I. Simulation", subtitle = "Coverage Probabilities on [0,1] scale") + scale_x_log10(breaks = c(20, 30, 50, 100, 250, 500, 1000, 2000)) + geom_hline(aes(yintercept = 0.95), colour="#BB0000", linetype="dashed", size = 1)
ggplot(pois, aes(x=sample_size, y=cov_prob)) + geom_point(size = 2) + geom_line(size = 1.5) + geom_errorbar(aes(ymin=cov_prob - 2*cov_prob_se, ymax = cov_prob + 2*cov_prob_se), width=.1) + xlab("Sample Size (log scale)") + ylab("Coverage Probs +/- 2 SEs") + ggtitle("Poisson Regression -- P.I. Simulation", subtitle = "Coverage Probabilities (zoom in)") + scale_x_log10(breaks = c(20, 30, 50, 100, 250, 500, 1000, 2000)) + geom_hline(aes(yintercept = 0.95), colour="#BB0000", linetype="dashed", size = 1)
ggplot(pois, aes(x=sample_size, y=int_width)) + geom_errorbar(aes(ymin=int_width - 2*int_width_se, ymax = int_width + 2*int_width_se), width=.1) + scale_x_log10(breaks = c(20, 30, 50, 100, 250, 500, 1000, 2000)) + xlab("Sample Size (log scale)") + ylab("Interval Widths +/- 2 SEs") + ggtitle("Poisson Regression -- P.I. Simulation", subtitle = "Interval Widths") + geom_line(size = 1.5) + geom_point(size = 1.5)
A negative binomial model with log link function was fit with dispersion parameter $\theta = 4$.
$$ y|x \sim \mathrm{NegBin}(\mu = \exp(1 + 2x), \theta = 4), \qquad x \in (1, 2) $$
For each sample size, $10,000$ models were fit. Estimated coverage probabilities lie below the nominal level for small sample sizes. Statistical "folklore" recommends against fitting negative binomial models on a small sample, so our observations are in-line with this advice. Even though observed coverage probabilities closely agree with the nominal level, we find that interval width estimates tend to actually increase slightly with sample size, a trend we find concerning given the standard errors of the interval width estimates. Possible explanations of this trend could be our requirement that our prediction intervals are forced to lie in the positive integers, and that we do not simulate the distribution of the dispersion parameter. These results lead us to believe that more study is warranted on this type of interval estimate.
neg_bin <- read.csv("negbin_pi_results.csv") %>% dplyr::select(-total_time) %>% rename(nominal_level = level) %>% dplyr::select(sample_size, everything()) knitr::kable(neg_bin)
ggplot(neg_bin, aes(x=sample_size, y=cov_prob)) + geom_point(size = 2) + geom_line(size = 1.5) + geom_errorbar(aes(ymin=cov_prob - 2*cov_prob_se, ymax = cov_prob + 2*cov_prob_se), width=.1) + ylim(0, 1) + xlab("Sample Size (log scale)") + ylab("Coverage Probs +/- 2 SEs") + ggtitle("Negative Binomial Regression -- P.I. Simulation", subtitle = "Coverage Probabilities on [0,1] scale") + scale_x_log10(breaks = c(20, 30, 50, 100, 150, 200, 250, 500, 1000, 2000)) + geom_hline(aes(yintercept = 0.95), colour="#BB0000", linetype="dashed", size = 1)
ggplot(neg_bin, aes(x=sample_size, y=cov_prob)) + geom_point(size = 2) + geom_line(size = 1.5) + geom_errorbar(aes(ymin=cov_prob - 2*cov_prob_se, ymax = cov_prob + 2*cov_prob_se), width=.1) + xlab("Sample Size (log scale)") + ylab("Coverage Probs +/- 2 SEs") + ggtitle("Negative Binomial Regression -- P.I. Simulation", subtitle = "Coverage Probabilities (zoom in)") + scale_x_log10(breaks = c(20, 30, 50, 100, 150, 200, 250, 500, 1000, 2000)) + geom_hline(aes(yintercept = 0.95), colour="#BB0000", linetype="dashed", size = 1)
ggplot(neg_bin, aes(x=sample_size, y=int_width)) + geom_errorbar(aes(ymin=int_width - 2*int_width_se, ymax = int_width + 2*int_width_se), width=.1) + scale_x_log10(breaks = c(20, 30, 50, 100, 150, 200, 250, 500, 1000, 2000)) + xlab("Sample Size (log scale)") + ylab("Interval Widths +/- 2 SEs") + ggtitle("Negative Binomial Regression -- P.I. Simulation", subtitle = "Interval Widths") + geom_line(size = 1.5) + geom_point(size = 1.5)
A Gamma regression model with "inverse"
link function was fit
and simulated $10,000$ times.
$$ y|x \sim \Gamma(\mathrm{shape} = 5, \mathrm{rate} = \frac{5}{2 + 4x}) \qquad x \in (30, 70) $$
Gamma regression was not discussed in this vignette but is still
supported by ciTools
. Estimated coverage probabilities are generally
close, but consistently slightly below, the nominal level. In contrast
to the results from the Poisson simulation, these coverage
probabilities are within simulation error of the nominal
level. Average interval widths tend to exhibit a high degree of
variation.
gam <- read.csv("gamma_pi_results.csv") %>% dplyr::select(-total_time) %>% rename(nominal_level = level) %>% dplyr::select(sample_size, everything()) knitr::kable(gam)
ggplot(gam, aes(x=sample_size, y=cov_prob)) + geom_point(size = 2) + geom_line(size = 1.5) + geom_errorbar(aes(ymin=cov_prob - 2*cov_prob_se, ymax = cov_prob + 2*cov_prob_se), width=.1) + ylim(0, 1) + xlab("Sample Size (log scale)") + ylab("Coverage Probs +/- 2 SEs") + ggtitle("Gamma Regression -- P.I. Simulation", subtitle = "Coverage Probabilities on [0,1] scale") + scale_x_log10(breaks = c(100, 250, 500, 1000, 2000)) + geom_hline(aes(yintercept = 0.95), colour="#BB0000", linetype="dashed", size = 1)
ggplot(gam, aes(x=sample_size, y=cov_prob)) + geom_point(size = 2) + geom_line(size = 1.5) + geom_errorbar(aes(ymin=cov_prob - 2*cov_prob_se, ymax = cov_prob + 2*cov_prob_se), width=.1) + xlab("Sample Size (log scale)") + ylab("Coverage Probs +/- 2 SEs") + ggtitle("Gamma Regression -- P.I. Simulation", subtitle = "Coverage Probabilities (zoom in)") + scale_x_log10(breaks = c(100, 250, 500, 1000, 2000)) + geom_hline(aes(yintercept = 0.95), colour="#BB0000", linetype="dashed", size = 1)
ggplot(gam, aes(x=sample_size, y=int_width)) + geom_errorbar(aes(ymin=int_width - 2*int_width_se, ymax = int_width + 2*int_width_se), width=.1) + scale_x_log10(breaks = c(100, 250, 500, 1000, 2000)) + xlab("Sample Size (log scale)") + ylab("Interval Widths +/- 2 SEs") + ggtitle("Gamma Regression -- P.I. Simulation", subtitle = "Interval Widths") + geom_line(size = 1.5) + geom_point(size = 1.5)
In practice, the (non-canonical) log-link function is more
common due to its numerical stability. Since the choice of link
function does not drastically alter our prediction interval procedure,
ciTools
can also handle these types of models.
Gaussian prediction intervals may be determined parametrically by
ciTools
. The model we considered was
$$ y|x \sim \mathcal{N}(\mathrm{mean} = \exp(1 + x), \mathrm{sd} = 1) \qquad x \in (0,1) $$
Since a bootstrap is unnecessary, total running time for the simulation is much faster than for simulations involving other distributions. Coverage probabilities are estimated within a standard error or two of $0.95$ and interval widths generally decrease according to the increased sample size.
norm_log <- read.csv("gaussian_pi_loglink_results.csv") %>% dplyr::select(-total_time) %>% rename(nominal_level = level) %>% dplyr::select(sample_size, everything()) knitr::kable(norm_log)
ggplot(norm_log, aes(x=sample_size, y=cov_prob)) + geom_point(size = 2) + geom_line(size = 1.5) + geom_errorbar(aes(ymin=cov_prob - 2*cov_prob_se, ymax = cov_prob + 2*cov_prob_se), width=.1) + ylim(0, 1) + xlab("Sample Size (log scale)") + ylab("Coverage Probs +/- 2 SEs") + ggtitle("Gaussian Regression -- P.I. Simulation", subtitle = "Coverage Probabilities on [0,1] scale") + scale_x_log10(breaks = c(20, 30, 50, 100, 250, 500, 1000, 2000)) + geom_hline(aes(yintercept = 0.95), colour="#BB0000", linetype="dashed", size = 1)
ggplot(norm_log, aes(x=sample_size, y=cov_prob)) + geom_point(size = 2) + geom_line(size = 1.5) + geom_errorbar(aes(ymin=cov_prob - 2*cov_prob_se, ymax = cov_prob + 2*cov_prob_se), width=.1) + xlab("Sample Size (log scale)") + ylab("Coverage Probs +/- 2 SEs") + ggtitle("Gaussian Regression -- P.I. Simulation", subtitle = "Coverage Probabilities (zoom in)") + scale_x_log10(breaks = c(20, 30, 50, 100, 250, 500, 1000, 2000)) + geom_hline(aes(yintercept = 0.95), colour="#BB0000", linetype="dashed", size = 1)
ggplot(norm_log, aes(x=sample_size, y=int_width)) + geom_errorbar(aes(ymin=int_width - 2*int_width_se, ymax = int_width + 2*int_width_se), width=.1) + scale_x_log10(breaks = c(20, 30, 50, 100, 250, 500, 1000, 2000)) + xlab("Sample Size (log scale)") + ylab("Interval Widths +/- 2 SEs") + ggtitle("Gaussian Regression -- P.I. Simulation", subtitle = "Interval Widths") + geom_line(size = 1.5) + geom_point(size = 1.5)
ciTools
is a versatile R package that helps users quantify
uncertainty about their generalized linear models. Creating interval
estimates that are amenable to plotting is now as simple as fitting a
GLM. To date, we provide coverage for many common GLMs used by
practitioners. For the models covered by ciTools
, our simulations
show that our confidence and prediction intervals are trustworthy.
There is still work to be done on this portion of ciTools
. We would like to
Include more parametric methods for prediction intervals.
Add facilities to handle CIs and PIs when offset terms are present in the model fit.
Further study interval estimates pertaining to the negative binomial model.
Produce a simulation study that compares parametric vs. bootstrap intervals for GLMs.
Offer alternative prediction intervals e.g. the shortest intervals that contain $95\%$ of the simulated data.
Include options beyond BCa for creating bootstrap confidence intervals.
sessionInfo()
Any scripts or data that you put into this service are public.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.