In mac-theobio/effects: isolated confidence intervals and bias-corrected predictor effects

Generating predictions and confidence intervals of predictions for regression models is surprisingly hard. A short list of the challenges:

dealing with multiple predictor variables: choosing representative values for one or more 'focal variables' and choosing an appropriate 'model center' for non-focal variables
Jensen's inequality and bias in the mean induced by nonlinear transformations of the response variable (e.g., link functions in GLM(M)s)
propagation of uncertainty ([how] can we incorporate uncertainty in nonlinear components of (G)LMMs? should we exclude variation in non-focal parameters (JD-style effect plots)? Can we avoid assuming a Gaussian sampling distribution of the uncertainty, e.g. by importance sampling?) (Many but not necessarily all of these issues are resolved by moving to Bayesian sampling-based approaches, at the cost of additional (explicit) assumptions and computational effort.)

More broadly:

what questions do we want to ask (this implicitly determines answers to questions about where the model center should be)?
what assumptions are we willing to make (multivariate normality, weak nonlinearity [for delta-method approximations], independence of non-focal variables?)

Note that we will be interested in/concerned about the distribution of the predictors (we have the distribution represented by our sample; we may want to treat this as a sample from an underlying distribution, e.g. multivariate Normal or something else, both for computational convenience and in order to get a smoother prediction) as well as the sampling distribution of the coefficients (which we will often assume is multivariate Normal for convenience). Random-effects coefficients are an interesting case; they are coefficients, but we could also treat them as part of a population sample. (We usually assume that random-effects coefficients are independent of everything else.)

In what follows let's say we have a focal predictor $x_f$ and a set of non-focal predictors $x_{n}$ (and without loss of generality, the focal predictor is first in the vector of predictors).

Population-based approach

\newcommand{\bX}{{\mathbf X}} \newcommand{\bbeta}{{\boldsymbol \beta}} \newcommand{\boldeta}{{\boldsymbol \eta}}

The most precise (although not necessarily accurate!) way to predict is to condition on a value $F$ of the focal predictor and make predictions for all observations (members of the population) for which $x_f = F$ (or in a small range around $F$ ...). The linear predictor values are $\boldeta = (F,\bX_{{-f}}) \bbeta$. A key point is that the nonlinear transformation involved in these computations is always \emph{one-dimensional}; all of the multivariate computations required are at the stage of collapsing the multidimensional set of predictors for some subset of the population (e.g. all individuals with $x_f = F$) to a one-dimensional distribution of $\eta$.

Once we've got our vector of $\boldeta$ (which is essentially a set of samples from the distribution over $\eta$ for the conditional set), we want a mean and confidence intervals on the mean on the response (data) scale, i.e. after back-transforming. Most precisely the mean is $\int P(\eta') g^{-1}(\eta') d\,\eta$.

if we use the observations themselves then we just compute the individual values of $g^{-1}(\eta)$ and compute their mean
we could compute the quantiles of the distribution of $\eta$ and use this to construct an approximate Riemann sum over the distribution ($\sum g^{-1}(Q_i(\eta))/n_Q$ ?)
compute the mean and variance of the $\eta$ values, assume Normality (or some other distribution we like), and integrate numerically
for some inverse-link functions like the exponential (but not logistic), there's a closed-form solution
assume Normality, use Gaussian quadrature instead of brute-force numerical integration
assume Normality, use the delta-method (second-order Taylor) approximation

What about the confidence intervals? The limits of the confidence intervals are points, not mean values. In principle, every observation/set of non-focal predictors has a different CI. The predictions are $\bX \bbeta$; the variances of the predictions are $\textrm{Diag}(\bX \Sigma \bX^\top)$ (I think [this can be computed more efficiently since we only want the diagonal: see here]). I'm not sure how to combine these: do we take the distributions of the upper/lower CIs (e.g. Wald intervals using $\pm q \sigma$ where $q$ is an appropriate quantile of the Normal or t distribution) and find their transformed mean values in the same way? ??? If we consider only the uncertainty of the focal predictor, so that the confidence intervals are $\eta \pm q \sigma_f$, we still have the same question of how to condense the set of endpoints.

If we compute the individual back-transformed predictions for a poorly sampled/finely spaced set of focal values, we will get a noisy prediction line as the values of the non-focal predictors shift across the focal values. (Simple example: suppose everyone below the median age has wealth index $w_1$, everyone above the median has $w_2$. Then the predicted value will have a discontinuity at the median age.) We can deal with this by taking bigger bins (a form of smoothing), or by post-smoothing the results (by loess or something). The principled form of this would be to assume/recognize that our uneven distribution of observed non-focal predictors actually represents a sample of a distribution that will vary \emph{smoothly} as a function of the focal predictor.

Full-normal approximation

Suppose we assume that the full distribution of all of the predictors is multivariate Normal. Then for a fixed value of $x_f = F$ the conditional distribution of the non-focal predictors is also MVN with mean and covariance matrix we can get from Wikipedia [!!] (see also lots of answers from this twitter thread, including this blog post). Once we know the mean and variance of the predictors we can directly compute the mean and variance of the linear predictor (mean is $\bX' \bbeta$ as before, variance is $\bbeta \Sigma' \bbeta\top$), and proceed as above.

Links to MRP

What does MRP do about this? Does the Bayesian averaging over uncertainty somehow smooth over the population distribution? Or do they directly construct an estimate of the population-level distributions of predictors?

junk

Closed form integral of Gaussian over logistic curve?

We are interested in the mean of logistic(gaussian)

Tried it in Python (and Wolfram Alpha). This is just to reiterate what we knew already, which is that the logistic-normal doesn't have a closed form solution (WA can actually do the non-generic case (standard Normal), which sympy can't, but even WA breaks down for non-standard Normal case.

```{python integrate, eval=FALSE} from sympy import x = Symbol('x') integrate(exp(-(x*2))/(1+exp(-x)), (x, -oo, oo))

A [CrossValidated question](https://stats.stackexchange.com/questions/45267/expected-value-of-a-gaussian-random-variable-transformed-with-a-logistic-functio) about calculating this transformation (and various interesting series approximations etc., although these are very likely not worth the trouble).  Turns out that if we don't want to use the delta method there is an existing package that codes up the mean & variance of the logistic-normal (exact, up to integration accuracy):

```r
logitnorm::momentsLogitnorm(mu= 2, sigma = 2)
x <- rnorm(1e5, 2, 2)
x <- scale(x)*2 + 2 ## exact moments
mean(x)  
mean(plogis(x))  ## correct value
plogis(mean(x))  ## naive value
emdbook::deltamethod(1/(1+exp(-x)), x)
## The `delta2` value is supposedly a higher-order Taylor expansion/correction, but a little bit of poking around suggests these are bogus/possibly wrong ...

The entire mln() function definition is not very complicated, we don't really need a package for this ...

function (mu, sigma, abs.tol = 0, ...) {
    fExp <- function(x) plogis(x) * dnorm(x, mean = mu, sd = sigma)
    .exp <- integrate(fExp, -Inf, Inf, abs.tol = abs.tol, ...)$value
    fVar <- function(x) (plogis(x) - .exp)^2 * dnorm(x, mean = mu, 
        sd = sigma)
    .var <- integrate(fVar, -Inf, Inf, abs.tol = abs.tol, ...)$value
    c(mean = .exp, var = .var)
}

Prediction vs effects plots

In which I try to figure out what the hell we're actually trying to do.

Notation:

\newcommand{\nset}[1]{#1_{{n}}} \newcommand{\yref}{y_{\textrm{ref}}} \newcommand{\cdist}{{D(\nset{x}|x_f)}} \newcommand{\cdistprime}{{D(\nset{x}|x_{f'})}} \newcommand{\xfprime}{x_{f'}}

$x_f$: a value of the focal predictor
$\nset{x}$: a vector of values of the non-focal predictors for a particular observation
$\eta(x_f, \nset{x}) = \beta_f x_f + \sum \nset{\beta} \nset{x}$ = linear predictor (e.g. prediction on the log-odds scale)
$g^{-1}()$: inverse-link function (e.g. logistic)
$\cdist$: distribution of the non-focal predictors conditional on a particular value of the focal predictor
$\beta_{fi}$: the coefficient describing the interaction(s) of the focal and non-focal parameters

The purpose and goal of a prediction plot seems fairly straightforward; for specified values of (a) focal predictor(s), we want to give a point estimate and confidence intervals for the prediction of the model for a "typical" (= random sample over the multivariate distribution of non-focal parameters) individual with those values of the predictors ("what is the expected probability of having clean water for a 25-year-old male"?). These are most of the problems we have actually been addressing with our bias correction stuff above. To do these computations, we need to take the values of the non-focal predictors (or their means and covariances) from the conditional distribution. (If the focal predictors are discrete, we condition on exact values; if they are continuous/have mostly unique values, we condition on appropriately sized bins.) In other words, $$ \underset{\cdist}{\textrm{mean}} g^{-1} \left(\eta(\nset{x},x_f) \right) $$

In contrast, an effects plot is supposed to show the marginal effect (in the economics/differential sense) of varying the focal parameter(s). This means, technically that we want the derivative (and its uncertainty) with respect to the focal parameters to match that predicted by the model.

Although it might not matter, we do need pick a reference point to start from (e.g. the prediction for the model center). Then, on the linear predictor scale, we can compute the derivative (which is just the coefficient for the focal parameter unless there are interactions, in which case we need to incorporate the conditional values/distribution of the non-focal parameters in the calculation). If there are no interactions, then the derivative on the response scale is the same for the entire distribution and by the chain rule is the product of the derivative ($\beta_f = d\eta/dx_f$) and the derivative of the inverse link function at the point ($dg^{-1}(\eta)/d\eta$, which is the mu.eta component of a family object in R). How does variability (due to variation in the non-focal parameters + interactions) affect this result?? Suppose $\hat \beta_f(x_n)$ is the derivative for a particular set of non-focal parameters (i.e. $\hat \beta_f(x_n) \equiv \beta_f + \sum \beta_{f\nset{i}} \nset{x}$ where the second term is the sum of the relevant interaction coefficients). Then I claim that the marginal effect at a particular value of the focal parameter $\xfprime$ is

$$ \overline{\frac{d \mu}{dx_f}}(\xfprime) = \int_\cdistprime \frac{dg^{-1}(\eta)}{d\eta} \hat \beta_f \, d \nset{x} $$

$\hat \beta_f$ is linear in $\nset{x}$, so nonlinear averaging doesn't affect it, but we do have to worry about weighting the different values of $\hat \beta f$ by the corresponding values of $dg^{-1}(\eta)/d\eta$ (since $\eta$ also depends on $\nset{x}$).

It seems that the appropriate computed value (y-axis height) of the effects plot at a given value of $x$ is

$$ \yref + \iint_{D(\nset{x}|x_f')} \frac{dg^{-1}(\eta(\xfprime, \nset{x})}{d\eta} \hat \beta_f \, d \nset{x} \, d x_f $$

In the case without interactions $\hat \beta f = \beta_f$ is a constant and we can pull it out of the integrals, but unless the distribution of the non-focal parameters is independent of the focal parameter(s) it looks like we might have to do some relatively complicated stuff.

In practice to implement this I would suggest:

compute $\yref$ by taking $\cdist$ at the center value of the focal parameter and finding the mean (either directly or by computing the mean and variance of $\eta$ and using the logistic-normal calculation
for each step away from the center
find $\cdist$ for the new value of $x_f$;
if necessary (i.e. if the model has interactions with the focal parameter) compute $\hat \beta_f$ across $\cdist$;
compute the derivative across $\cdist$
compute the mean value of $\phi = dg^{-1}/d\eta \hat \beta_f$
add $\phi \Delta x_f$ to the effects-plot value

(BB prefers this, JD says let's not do it now)

Alternatives

Do everything by computing means of $g^{-1}$ applied to different values of the focal predictor and the unconditional distribution of non-focal parameters (I think this is what Steve is currently doing, at least in some cases, and what JD would prefer?) (JD's preference, BB is OK with it for now)
if the unconditional distribution is different from (e.g. much wider than) the conditional distribution at the central value of $x_f$, then the computed value of $\yref$ won't match the observed marginal probability computed from the data
Compute $\yref$ using the conditional distribution for the central value of $x_f$, then compute the increments of $dg^{-1}(\eta)/d\eta$ using the unconditional (population-wide) distribution of $\nset{x}$ (JD and BB agree we shouldn't do this)
you can certainly do this, but it seems incoherent to me based on the discussion above
Compute $\yref$ using the conditional distribution for the central value of $x_f$, then compute the increments of $dg^{-1}(\eta)/d\eta$ using the conditional distribution of $\nset{x}$ for the central value of $x_f$ (JD and BB agree we shouldn't do this)
this is coherent but answers an odd question ("what would the predicted response be if we took a typical (central) individual and modified only their focal characteristic, leaving all of their non-focal values the same?"). We would end up (for example) with a prediction of what we would get if we took a 30-year-old and let them age to 90 without ever changing their wealth or education status ...

I think the key is to start by asking "what is the effects plot supposed to represent?"; I claim it's supposed to represent (in some way) the marginal (economic-sense) effect of the focal parameter on typical individuals at a given value of the focal parameter. It might be less confusing to plot the effect itself (e.g. put d(probability)/d(age) on the y-axis, rather than probability) — it would eliminate one of the integrals — but that's not the convention.

mac-theobio/effects documentation built on July 6, 2023, 4:19 a.m.

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

mac-theobio/effects
isolated confidence intervals and bias-corrected predictor effects

In mac-theobio/effects: isolated confidence intervals and bias-corrected predictor effects

Population-based approach

Full-normal approximation

Links to MRP

junk

Prediction vs effects plots

Alternatives

R Package Documentation

Browse R Packages

We want your feedback!

mac-theobio/effects isolated confidence intervals and bias-corrected predictor effects

In mac-theobio/effects: isolated confidence intervals and bias-corrected predictor effects

Population-based approach

Full-normal approximation

Links to MRP

junk

Prediction vs effects plots

Alternatives

R Package Documentation

Browse R Packages

We want your feedback!

mac-theobio/effects
isolated confidence intervals and bias-corrected predictor effects