simulate_posterior_obs: Simulate 'observed' values and prediction uncertainty from a...

View source: R/simulate_from_gam.R

simulate_posterior_obsR Documentation

Simulate 'observed' values and prediction uncertainty from a Gaussian bam or gam model

Description

This function enables the user to simulate new 'observed' values from a Gaussian bam or gam model, including for models with an AR1 structure. There are three main options. The first option is for the user to simulate a single realisation from a model. Under this scenario (type = "snapshot"), the function samples a vector of observed values from a Gaussian distribution with a mean vector equal to the expected values and a standard deviation equal to the estimated sigma parameter (i.e. the standard deviation of residuals), accounting for autocorrelation in the sampling process if necessary. One reason as to why this is useful is to investigate whether the estimate of the residual standard deviation is appropriate: while the simulated time series will not reproduce the observed time series, the simulated time series should approximately reproduce the observed variation in the time series. However, this method only provides a single 'snapshot' realisation of a model, which does not capture uncertainty or calculate prediction intervals. When type = "envelope", the function calculates the posterior distribution of predicted values in a two stage process which accounts for (a) uncertainty in the mean (possibly via uncertainty in the underlying fitted coefficients, see below) and (b) uncertainty in the predictions due to variation around expected values. To calculate a posterior distribution for the mean (i.e., expected values), there are two methods. Under mu_method = 1, the assumption is that the uncertainty in the expected values can be characterised by a Gaussian distribution. For each observation, n possible mean values are sampled from a Gaussian distribution with a mean equal to the expected value for that observation and a SD equal to the estimated SE around that mean for that observation. mu_method = 2 is more sophisticated. Under mu_method = 2, uncertainty in the fitted coefficients is characterised by a multivariate normal distribution, with mean vector given by fitted coefficients and a covariance matrix equal to their (Bayesian) covariance matrix. For each observation, the function then samples n new fitted coefficient vectors from this multivariate distribution. Each sample is combined with the linear predictor matrix to generate a single simulated mean for each observation. This method is implemented by simulate_posterior_mu. With a posterior distribution for expected values, the function then samples n new 'observed' values using expected values, accounting for autocorrelation if necessary. The function returns the posterior distribution of simulated observed values, either in full (i.e., a matrix with a row for each observation and n realised values for new predicted values for that observation), a summary computed by summarise_posterior (i.e., for each observation, the mean of the posterior distribution and a lower and upper quantile), or both.

Usage

simulate_posterior_obs(
  model,
  newdata,
  rho,
  ind = NULL,
  type = "snapshot",
  mu_method = 1,
  seed = NULL,
  n,
  return = "both",
  probs = c(0.025, 0.975),
  summary_format = "list"
)

Arguments

model

A model from which to simulate (see predict.gam).

newdata

A dataframe from which to make predictions (see predict.gam).

rho

A numeric value which defines the rho value from an bam model. For models without an AR1 structure, the default value (rho = 0) is appropriate.

ind

A logical vector, of the same length as the number of rows in newdata, which defines independent sections of time series (see bam and the AR.start argument).

type

A character input which defines the type of simulation. Options are: (1) "snapshot", which returns a single realisation of the model; or (b) "envelope", which performs multiple simulations and returns a posterior distribution of outcomes.

mu_method

A numeric input (1 or 2) which defines the method used to simulate the mean, as described in the function Description. mu_method = 2 is recommended.

seed

A numeric value to set the seed.

n

A numeric value which defines the number of simulations.

return

A numeric vector which defines the quantiles of the posterior distribution to return as the lower and upper confidence intervals, if return = "summary" or return = "both" (see summarise_posterior).

probs

A numeric vector which defines the quantiles of the posterior distribution to return as the lower and upper confidence intervals, if return = "summary" or return = "both" (see summarise_posterior).

summary_format

A character input which defines the format of the summary distribution, if return = "summary" or return = "both" (see summarise_posterior).

Details

The function makes two assumptions which may cause a slight underestimate in the 'true' uncertainty: both the model's sigma and, if applicable, the inputted rho are fixed (i.e. assumed to be known). These parameters could also be treated as uncertain, with their uncertainty characterised by probability distributions.

Value

The function returns a posterior distribution matrix for predictions, a summary of these predictions or both.

Author(s)

Edward Lavender

Examples

#####################################################
#### Simulate some data and fit a GAM

#### Simulation and model fitting
set.seed(1)
nobs <- 100
x <- stats::runif(nobs, 0, 1000)
mu <- 0.001 * x^2
y <- rnorm(nobs, mu, 100)
d <- data.frame(x = x, y = y)
m1 <- mgcv::gam(y ~ s(x), data = d)

#### We can use simulate_posterior_mu to add a fitted line and CIs from posterior simulation
# ... This shows our confidence in the fitted line, as exemplified below.
nd <- data.frame(x = seq(min(d$x), max(d$x), length.out = 100))
sim_mu <-
  simulate_posterior_mu(
    model = m1,
    newdata = nd,
    n = 1000,
    return = "summary")
plot(d$x, d$y)
names(sim_mu)[1] <- "fit"
prettyGraphics::add_error_envelope(x = nd$x, ci = sim_mu)
# But what about our confidence in predictions? This is where
# ... simulate_posterior_obs() comes in.

#####################################################
#### Use of simulate_posterior_obs() (1):
# .. Simulate 'snapshots' of 'observed' values

#### Example (1): Simulate a single realisation of 'observed values'
sim_pos1 <-
  simulate_posterior_obs(
    model = m1,
    newdata = d,
    rho = 0,
    type = "snapshot")
# The function returns a dataframe with 'fit', 'se_fit' and 'obs_sim'.
# The function calculates fitted values and SEs from the model and then, assuming a Gaussian model,
# ... samples a single realisation of 'observed' values from a Gaussian distribution with a mean
# ... equal to the fitted values and an SD equal to the esimated SD
# ... 'obs_sim' are the simulated observed values.
utils::str(sim_pos1)
points(d$x, sim_pos1$obs_sim, col = "red")


#### Example (2): The function can simulate realisations accounting for autocorrelation:
# simulate some autocorrelated observations following an AR1 structure
d$yAR <- mu + arima.sim(list(order = c(1,0,0), ar = 0.95), n = nrow(d), sd = sigma_arima(0.95, 100))
# use mgcv::bam() to estimate the AR1 parameter and
# ... then implement a model accounting for autocorrelation
# with sufficient data here, the AR1 estimated using the ACF is quite accurate:
mAR0 <- mgcv::bam(yAR ~ s(x), data = d)
AR1 <- stats::acf(stats::resid(mAR0))$acf[2]; AR1
mAR1 <- mgcv::bam(yAR ~ s(x), rho = AR1, data = d)
# mAR1 model effectively captured autocorrelation:
stats::acf(mAR1$std.rsd)$acf[2];
# Note that we've simulated data such that the sigma of a model with/without
# ... autocorrelation is the same (or similar given random variation):
stats::sigma(m1); stats::sigma(mAR1)
# Simulate a single realised sequence of 'observations from this model,
# ... accounting for the estimated AR1 parameter
# ... (This is assumed to be fixed.)
sim_pos2 <-
  simulate_posterior_obs(
    model = mAR1,
    newdata = d,
    rho = AR1,
    ind = NULL,
   type = "snapshot")
# Examine a time series of uncorrelated observations, correlated observations and
# ... simulated observations which include autocorrelation:
d$index <- 1:nrow(d)
pp <- par(mfrow = c(2, 3))
# Initially simulated y values shouldn't show evidence of autocorrelation,
# ... while simulated yAR values/observations from the model should show autocorrelation
plot(d$index[1:100], d$y[1:100], type = "l")
plot(d$index[1:100], d$yAR[1:100], type = "l")
plot(d$index[1:100], sim_pos2$obs_sim[1:100], type = "l")
stats::acf(stats::resid(mgcv::bam(y ~ s(d$x))))$acf[2]
stats::acf(stats::resid(mgcv::bam(sim_pos2$obs_sim ~ s(d$x))))$acf[2]
par(pp)


#####################################################
#### Use of simulate_posterior_obs() (2):
# .. Use simulation to create prediction intervals

#### Introduction
# The above approach is useful in quickly demonstrating how
# ... well our model captures the observed variation, but it only represents
# ... a rough-and-ready 'snapshot' of predictions. Often, we want a prediction interval instead,
# ... computed across many simulations.

#### Example (3): Simulate the full posterior distribution using mu_method = 1
# When mu_method = 1, the expected value, on the scale of the response, is assumed to be normally
# ... distributed around the true value. For each observation, n simulated values for the expected
# ... value are drawn. Then, for each one of those, an 'observed' value is simulated from
# ... the expected value (accounting for autocorrelation).
sim_pos3 <-
  simulate_posterior_obs(
    model = mAR1,
    newdata = d,
    rho = AR1,
    ind = NULL,
    type = "envelope",
    mu_method = 1,
    n = 250,
    return = "full")
# The function returns the full posterior matrix, i.e., a matrix
# ... in which the rows refer to observations and each column
# ... stored an observed value from one simulation.
utils::str(sim_pos3)

#### Example (4): Simulate the full posterior distribution using mu_method = 2
# When mu_method = 2, the expected value, on the scale of the response, is simulated using
# ... simulate_posterior_mu(). This approach treats fitted cofficients as realisations
# ... of a multivariate normal distribution. New coefficients are sampled from this distribution,
# ... and combined with the linear predictor matrix to generate expected values. As above,
# ... observed values are then simulated from expected values.
sim_pos4 <-
  simulate_posterior_obs(
    model = mAR1,
    newdata = d,
    rho = AR1,
    type = "envelope",
    mu_method = 2,
    n = 250,
    return = "full")
utils::str(sim_pos4)

#### Example (5): The full posterior distribution can be summarised to create a prediction interval
# ... from either mu_method = 1 or mu_method = 2
# ... by defining 'return = "summary" and, optionally, arguments passed to summarise_posterior()
# ... i.e., the CIs and the summary_format.
# Simulate prediction intervals accounting for AR using mu_method 2:
sim_pos5 <-
  simulate_posterior_obs(
    model = mAR1,
    newdata = nd, # predict from a regular sequence of values to add CIs
    rho = AR1,
    type = "envelope",
    mu_method = 2,
    n = 250,
    return = "summary") # return a summary (default returns a list of 95% CIs which can be plotted)
# The function returns a list, with three elements that can be plotted using add_error_envelope()
utils::str(sim_pos5)
# Define a plot, using pretty_axis() to generate suitable axes that account for predictions:
names(sim_pos5)[1] <- "fit"
axis_ls <-
  prettyGraphics::pretty_axis(side = 1:2,
                           x = list(range(d$x),
                                    range(c(sim_pos5$fit,
                                            sim_pos5$lowerCI,
                                            sim_pos5$upperCI)
                                            )
                                          ),
                           pretty = list(n = 5), add = FALSE)
# Create plot and add axes and prediction intervals
plot(d$x, d$yAR, xlim = axis_ls[[1]]$lim, ylim = axis_ls[[2]]$lim, axes = FALSE)
prettyGraphics::pretty_axis(axis_ls = axis_ls, add = TRUE)
prettyGraphics::add_error_envelope(nd$x, sim_pos5)
# Compare prediction interval to confidence interval for mu,
# ... which is much narrower:
sim_mu_AR1 <-
  simulate_posterior_mu(
    model = mAR1,
    newdata = nd,
    n = 250,
    return = "summary")
names(sim_mu_AR1)[1] <- "fit"
prettyGraphics::add_error_envelope(x = nd$x, ci = sim_mu_AR1,
                                   add_ci = list(col = "dimgrey",
                                                 border = FALSE))

#### Examine the effects of accounting for autocorrelation on CIs and prediction intervals

# Let's compare a prediction interval accounting for autocorrelation to one which
# ... doesn't account for autocorrelation. The interval which does not include
# ... autocorrelation is similar in this case.
# Simulate the full posterior prediction matrix from model mAR1 with/without accounting rho:
sim_pos6_AR0 <-
  simulate_posterior_obs(
    model = mAR1,
    newdata = nd,
    rho = 0,
    type = "envelope",
    mu_method = 2,
    n = 250,
    return = "full")
sim_pos6_AR1 <-
  simulate_posterior_obs(
    model = mAR1,
    newdata = nd,
    rho = AR1,
    type = "envelope",
    mu_method = 2,
    n = 250,
    return = "full")
# For a single simulation scenario, examine the ACF
# This shows that the residuals are not correlated under the first scenario,
# ... but they are correlated under the second scenario in which observations were simulated
# ... with an AR1 process:
sim_pos6_mAR0 <- mgcv::bam(sim_pos6_AR0[1:100, 1] ~ s(nd$x))
sim_pos6_mAR1 <- mgcv::bam(sim_pos6_AR1[1:100, 1] ~ s(nd$x))
pp <- par(mfrow = c(1, 2))
stats::acf(stats::resid(sim_pos6_mAR0))
sim_pos6_rho <- stats::acf(stats::resid(sim_pos6_mAR1))$acf[2]
par(pp)
# For confidence intervals, accounting for autocorrelation usually makes CIs wider,
# ... but about for prediction intervals?
# summarise intervals
sim_pos6_AR0_PI <- summarise_posterior(sim_pos6_AR0)
sim_pos6_AR1_PI <- summarise_posterior(sim_pos6_AR1)
names(sim_pos6_AR0_PI)[1] <- "fit"
names(sim_pos6_AR1_PI)[1] <- "fit"
# Plot
plot(d$x, d$yAR, cex = 0.1, col = "grey")
prettyGraphics::add_error_envelope(nd$x, sim_pos6_AR0_PI,
                                   add_ci = list(col = "red",
                                                border = FALSE),
                                   add_fit = NULL
)
prettyGraphics::add_error_envelope(nd$x, sim_pos6_AR1_PI,
                                   add_ci = list(col = "blue",
                                                 border = FALSE),
                                   add_fit = NULL
)
# Prediction intervals are very similar. This is because the overall residual variation
# ... is similar inm both cases, which we can see if we compare the two original models:
stats::sigma(mAR0); stats::sigma(mAR1)
# The difference is that in the AR model, the residual variation
# ... is not all random, but some is 'deterministic'. This matters for uncertainty in the mean,
# ... but, once you've accounted for this, the observed variation around the mean is similar.

#### Example (6) Comparison of mu_method = 1 and mu_method = 2.
# Implement posterior simulation:
sim_pos7_mu1 <-
  simulate_posterior_obs(
    model = mAR1,
    newdata = nd,
    rho = AR1,
    type = "envelope",
    mu_method = 1,
    n = 250,
    return = "summary")
sim_pos7_mu2 <-
  simulate_posterior_obs(
    model = mAR1,
    newdata = nd,
    rho = AR1,
    type = "envelope",
    mu_method = 2,
    n = 250,
    return = "summary")
# Define axes:
axis_ls <-
  prettyGraphics::pretty_axis(side = 1:2,
                           x = list(range(d$x),
                                    range(
                                      c(sim_pos7_mu1$fit,
                                        sim_pos7_mu1$lowerCI,
                                        sim_pos7_mu1$upperCI,
                                        sim_pos7_mu2$fit,
                                        sim_pos7_mu2$lowerCI,
                                        sim_pos7_mu2$upperCI))
                                    ),
                           pretty = list(n = 5), add = FALSE)
# Plot prediction intervals:
plot(d$x, d$yAR,
     xlim = axis_ls[[1]]$lim, ylim = axis_ls[[2]]$lim,
     axes = FALSE, cex = 0.1,
     col = "grey")
prettyGraphics::pretty_axis(axis_ls = axis_ls, add = TRUE)
names(sim_pos7_mu1)[1] <- "fit"
names(sim_pos7_mu2)[1] <- "fit"
prettyGraphics::add_error_envelope(nd$x, sim_pos7_mu1,
                                   add_ci = list(col = "red", border = FALSE))
prettyGraphics::add_error_envelope(nd$x, sim_pos7_mu2,
                                   add_ci = list(col = "skyblue",
                                                 border = FALSE))
# similar results in this case.

#### Problematic example 1
# One of the reasons as to why posterior simulation of prediction intervals is useful
# ... is because a model which explains a small amount of variation may still produce
# ... useful insights if the error distribution captures the unexplained variation properly.
# ... To investigate this, let's compare the results of two models, one in which the
# ... model is properly specified and another in which we introduce heterogeneity:

#### Simulate data
# Simulate data so that the response (y2.1 or y2.2 is weakly related to a covariate,
# ... but in one case, we'll force heterogeneity):
x2 <- x
y2.1 <- rnorm(nrow(d), x2, 1000)
d2 <- data.frame(index = 1:length(x2), x2 = x2, y2.1 = y2.1)
d2$y2.2 <- rnorm(nrow(d2), x2, 1.725 * x2)
var(d2$y2.1); var(d2$y2.2) # overall variation is similar in both cases

#### Visualise simulated data:
# y2.2 ~ x2 shows heterogeneity:
pp <- par(mfrow = c(1, 2))
plot(d2$x2, d2$y2.1)
plot(d2$x2, d2$y2.2)
par(pp)

#### Models
m2.1 <- mgcv::gam(y2.1 ~ s(x2), data = d2)
m2.2 <- mgcv::gam(y2.2 ~ s(x2), data = d2)

#### Both models show limited amounts of variation
summary(m2.1)$dev.expl
summary(m2.2)$dev.expl

#### Posterior simulation from each model:
nd <- data.frame(x2 = seq(min(d2$x2), max(d2$x2), length.out = 100))
sim_pos_prob1.1 <-
  simulate_posterior_obs(
    model = m2.1,
    newdata = nd,
    rho = 0,
    type = "envelope",
    mu_method = 2,
    n = 250,
   return = "summary")
names(sim_pos_prob1.1)[1] <- "fit"

sim_pos_prob1.2 <-
  simulate_posterior_obs(
    model = m2.2,
    newdata = nd,
    rho = 0,
    type = "envelope",
    mu_method = 2,
    n = 250,
    return = "summary")
names(sim_pos_prob1.2)[1] <- "fit"


#### Plots: compare simulations of two models
pp <- par(mfrow = c(1, 2))
# Plot (1):
# Define axes:
axis_ls <-
  prettyGraphics::pretty_axis(side = 1:2,
                           x = list(range(d2$x2),
                                    range(
                                      c(sim_pos_prob1.1$fit,
                                        sim_pos_prob1.1$lowerCI,
                                        sim_pos_prob1.1$upperCI))
                                    ),
                           pretty = list(n = 5), add = FALSE)
# Plot prediction intervals:
plot(d2$x2, d2$y2.1, xlim = axis_ls[[1]]$lim, ylim = axis_ls[[2]]$lim, axes = FALSE)
prettyGraphics::pretty_axis(axis_ls = axis_ls, add = TRUE)
prettyGraphics::add_error_envelope(nd$x2, sim_pos_prob1.1)

# Plot (2):
# Define axes:
axis_ls <-
  prettyGraphics::pretty_axis(side = 1:2,
                           x = list(range(d2$x2),
                                    range(
                                      c(sim_pos_prob1.2$fit,
                                        sim_pos_prob1.2$lowerCI,
                                        sim_pos_prob1.2$upperCI))
                                        ),
                           pretty = list(n = 5), add = FALSE)
# Plot prediction intervals:
plot(d2$x2, d2$y2.2, xlim = axis_ls[[1]]$lim, ylim = axis_ls[[2]]$lim, axes = FALSE)
prettyGraphics::pretty_axis(axis_ls = axis_ls, add = TRUE)
prettyGraphics::add_error_envelope(nd$x2, sim_pos_prob1.2)
# It is clear that we're not capturing the variation appropriately in one part of the plot.
# This appears as problematic diagnostics.
# ... However, in this case, our model inferences are not affected strongly:
pp <- par(mfrow = c(3, 4))
mgcv::plot.gam(m2.1)
mgcv::gam.check(m2.1, cex = 0.1, col = "grey")
mgcv::plot.gam(m2.2)
mgcv::gam.check(m2.2, cex = 0.1, col = "grey")
par(pp)


edwardlavender/Tools4ETS documentation built on Nov. 29, 2022, 7:41 a.m.