Generating predictions and confidence intervals of predictions for regression models is surprisingly hard. A short list of the challenges:

More broadly:

Note that we will be interested in/concerned about the distribution of the predictors (we have the distribution represented by our sample; we may want to treat this as a sample from an underlying distribution, e.g. multivariate Normal or something else, both for computational convenience and in order to get a smoother prediction) as well as the sampling distribution of the coefficients (which we will often assume is multivariate Normal for convenience). Random-effects coefficients are an interesting case; they are coefficients, but we could also treat them as part of a population sample. (We usually assume that random-effects coefficients are independent of everything else.)

In what follows let's say we have a focal predictor $x_f$ and a set of non-focal predictors $x_{n}$ (and without loss of generality, the focal predictor is first in the vector of predictors).

Population-based approach

\newcommand{\bX}{{\mathbf X}} \newcommand{\bbeta}{{\boldsymbol \beta}} \newcommand{\boldeta}{{\boldsymbol \eta}}

The most precise (although not necessarily accurate!) way to predict is to condition on a value $F$ of the focal predictor and make predictions for all observations (members of the population) for which $x_f = F$ (or in a small range around $F$ ...). The linear predictor values are $\boldeta = (F,\bX_{{-f}}) \bbeta$. A key point is that the nonlinear transformation involved in these computations is always \emph{one-dimensional}; all of the multivariate computations required are at the stage of collapsing the multidimensional set of predictors for some subset of the population (e.g. all individuals with $x_f = F$) to a one-dimensional distribution of $\eta$.

Once we've got our vector of $\boldeta$ (which is essentially a set of samples from the distribution over $\eta$ for the conditional set), we want a mean and confidence intervals on the mean on the response (data) scale, i.e. after back-transforming. Most precisely the mean is $\int P(\eta') g^{-1}(\eta') d\,\eta$.

What about the confidence intervals? The limits of the confidence intervals are points, not mean values. In principle, every observation/set of non-focal predictors has a different CI. The predictions are $\bX \bbeta$; the variances of the predictions are $\textrm{Diag}(\bX \Sigma \bX^\top)$ (I think [this can be computed more efficiently since we only want the diagonal: see here]). I'm not sure how to combine these: do we take the distributions of the upper/lower CIs (e.g. Wald intervals using $\pm q \sigma$ where $q$ is an appropriate quantile of the Normal or t distribution) and find their transformed mean values in the same way? ??? If we consider only the uncertainty of the focal predictor, so that the confidence intervals are $\eta \pm q \sigma_f$, we still have the same question of how to condense the set of endpoints.

If we compute the individual back-transformed predictions for a poorly sampled/finely spaced set of focal values, we will get a noisy prediction line as the values of the non-focal predictors shift across the focal values. (Simple example: suppose everyone below the median age has wealth index $w_1$, everyone above the median has $w_2$. Then the predicted value will have a discontinuity at the median age.) We can deal with this by taking bigger bins (a form of smoothing), or by post-smoothing the results (by loess or something). The principled form of this would be to assume/recognize that our uneven distribution of observed non-focal predictors actually represents a sample of a distribution that will vary \emph{smoothly} as a function of the focal predictor.

Full-normal approximation

Suppose we assume that the full distribution of all of the predictors is multivariate Normal. Then for a fixed value of $x_f = F$ the conditional distribution of the non-focal predictors is also MVN with mean and covariance matrix we can get from Wikipedia [!!] (see also lots of answers from this twitter thread, including this blog post). Once we know the mean and variance of the predictors we can directly compute the mean and variance of the linear predictor (mean is $\bX' \bbeta$ as before, variance is $\bbeta \Sigma' \bbeta\top$), and proceed as above.

Links to MRP

What does MRP do about this? Does the Bayesian averaging over uncertainty somehow smooth over the population distribution? Or do they directly construct an estimate of the population-level distributions of predictors?

junk

Closed form integral of Gaussian over logistic curve?

We are interested in the mean of logistic(gaussian)

Tried it in Python (and Wolfram Alpha). This is just to reiterate what we knew already, which is that the logistic-normal doesn't have a closed form solution (WA can actually do the non-generic case (standard Normal), which sympy can't, but even WA breaks down for non-standard Normal case.

```{python integrate, eval=FALSE} from sympy import x = Symbol('x') integrate(exp(-(x*2))/(1+exp(-x)), (x, -oo, oo))

A [CrossValidated question](https://stats.stackexchange.com/questions/45267/expected-value-of-a-gaussian-random-variable-transformed-with-a-logistic-functio) about calculating this transformation (and various interesting series approximations etc., although these are very likely not worth the trouble).  Turns out that if we don't want to use the delta method there is an existing package that codes up the mean & variance of the logistic-normal (exact, up to integration accuracy):

```r
logitnorm::momentsLogitnorm(mu= 2, sigma = 2)
x <- rnorm(1e5, 2, 2)
x <- scale(x)*2 + 2 ## exact moments
mean(x)  
mean(plogis(x))  ## correct value
plogis(mean(x))  ## naive value
emdbook::deltamethod(1/(1+exp(-x)), x)
## The `delta2` value is supposedly a higher-order Taylor expansion/correction, but a little bit of poking around suggests these are bogus/possibly wrong ...

The entire mln() function definition is not very complicated, we don't really need a package for this ...

function (mu, sigma, abs.tol = 0, ...) {
    fExp <- function(x) plogis(x) * dnorm(x, mean = mu, sd = sigma)
    .exp <- integrate(fExp, -Inf, Inf, abs.tol = abs.tol, ...)$value
    fVar <- function(x) (plogis(x) - .exp)^2 * dnorm(x, mean = mu, 
        sd = sigma)
    .var <- integrate(fVar, -Inf, Inf, abs.tol = abs.tol, ...)$value
    c(mean = .exp, var = .var)
}

Prediction vs effects plots

In which I try to figure out what the hell we're actually trying to do.

Notation:

\newcommand{\nset}[1]{#1_{{n}}} \newcommand{\yref}{y_{\textrm{ref}}} \newcommand{\cdist}{{D(\nset{x}|x_f)}} \newcommand{\cdistprime}{{D(\nset{x}|x_{f'})}} \newcommand{\xfprime}{x_{f'}}

The purpose and goal of a prediction plot seems fairly straightforward; for specified values of (a) focal predictor(s), we want to give a point estimate and confidence intervals for the prediction of the model for a "typical" (= random sample over the multivariate distribution of non-focal parameters) individual with those values of the predictors ("what is the expected probability of having clean water for a 25-year-old male"?). These are most of the problems we have actually been addressing with our bias correction stuff above. To do these computations, we need to take the values of the non-focal predictors (or their means and covariances) from the conditional distribution. (If the focal predictors are discrete, we condition on exact values; if they are continuous/have mostly unique values, we condition on appropriately sized bins.) In other words, $$ \underset{\cdist}{\textrm{mean}} g^{-1} \left(\eta(\nset{x},x_f) \right) $$

In contrast, an effects plot is supposed to show the marginal effect (in the economics/differential sense) of varying the focal parameter(s). This means, technically that we want the derivative (and its uncertainty) with respect to the focal parameters to match that predicted by the model.

$$ \overline{\frac{d \mu}{dx_f}}(\xfprime) = \int_\cdistprime \frac{dg^{-1}(\eta)}{d\eta} \hat \beta_f \, d \nset{x} $$

$\hat \beta_f$ is linear in $\nset{x}$, so nonlinear averaging doesn't affect it, but we do have to worry about weighting the different values of $\hat \beta f$ by the corresponding values of $dg^{-1}(\eta)/d\eta$ (since $\eta$ also depends on $\nset{x}$).

It seems that the appropriate computed value (y-axis height) of the effects plot at a given value of $x$ is

$$ \yref + \iint_{D(\nset{x}|x_f')} \frac{dg^{-1}(\eta(\xfprime, \nset{x})}{d\eta} \hat \beta_f \, d \nset{x} \, d x_f $$

In the case without interactions $\hat \beta f = \beta_f$ is a constant and we can pull it out of the integrals, but unless the distribution of the non-focal parameters is independent of the focal parameter(s) it looks like we might have to do some relatively complicated stuff.

In practice to implement this I would suggest:

(BB prefers this, JD says let's not do it now)

Alternatives

I think the key is to start by asking "what is the effects plot supposed to represent?"; I claim it's supposed to represent (in some way) the marginal (economic-sense) effect of the focal parameter on typical individuals at a given value of the focal parameter. It might be less confusing to plot the effect itself (e.g. put d(probability)/d(age) on the y-axis, rather than probability) — it would eliminate one of the integrals — but that's not the convention.



mac-theobio/effects documentation built on July 6, 2023, 4:19 a.m.