## if the current working directory is the directory where this file is located rather than the project directory, set the working directory to the project directory
knitr::opts_chunk$set(fig.width = 5, fig.height = 3, message = FALSE)

In this document, we demonstrate how to apply Bayesian kernel machine regression (BKMR) for binary outcomes using the bkmr R package. See the overview guide for additional information about BKMR, including guided examples for continuous outcomes.

Probit kernel machine regression

We implement kernel machine regression (KMR) for binary outcomes,

$$ \Phi^{-1}(P(Y_i = 1)) = h(z_{i1}, \ldots, z_{iM}) + \beta{\bf x}i, \quad i = 1, \ldots, n $$ where $\Phi$ is the cummulative distribution function (CDF) for the standard normal distribution ($\Phi^{-1}$ is the probit link function), the outcome $Y_i$ is a binary (0/1) variable, $h$ is a flexible function of the predictor variables $z{i1}, \ldots, z_{iM}$, and ${\bf x}$ is a vector of covariates ($\beta$ is the corresponding vector of coefficients). We will refer to the predictors ${\bf z}$ as exposure variables and to $h(\cdot)$ as the exposure-response function. The function $h$ is modeled using a kernel machine representation that can capture complex, non-linear and non-additive, exposure-response relationships.

We implement probit, rather than logistic regression, primarily for reasons of computational convenience and efficiency for Bayesien inference using Gibbs sampling. In particular, for this we note that the probit model above can be reexpressed by incorporating a latent normal random variable ($Y^$), as $$ Y_i^ = h(z_{i1}, \ldots, z_{iM}) + \beta{\bf x}_i + \epsilon_i, \quad i = 1, \ldots, n $$ where $\epsilon_i ~ \mbox{N}(0,1)$ and $Y_i = I(Y_i^ > 0)$ is equal to 1 if $Y_i^ > 0$ and is equal to 0 otherwise. In our example below we will demonstrate how the exposure-response function $h$ can be interpreted under the probit regression model.

Example

First, load the R package.

suppressMessages(library(bkmr))
##suppressMessages(devtools::document())
##devtools::load_all()
library("bkmr")

Generate data

Let's consider a simple example with outcome data are generating under the probit model above, where the true exposure-response function only depends on a single exposure variable.

set.seed(123)
n <- 200 ## number of observations
M <- 4 ## number of exposure variables
beta.true <- 0.1
Z <- matrix(runif(n * M, -1, 1), n, M)
x <- 3*cos(Z[, 1]) + 2*rnorm(n)
hfun <- function(z) (2*z + 0.5) ^ 2
h <- hfun(Z[, 1]) ## only depends on z1

## generate using latent normal representation
eps <- rnorm(n)
ystar <- x * beta.true + h + eps
y <- ifelse(ystar > 0, 1, 0)

datp <- list(n = n, M = M, beta.true = beta.true, Z = Z, h = h, X = cbind(x), y = y, eps = eps, ystar = ystar)
rm(n, M, beta.true, Z, x, h, eps, y, ystar)

Let's view the true exposure-response function used to generate the data.

curve(hfun, from = min(datp$Z[, 1]), max(datp$Z[, 1]),
      xlab = expression(z[1]), ylab = expression(h(z[1])))

Fit BKMR

To fit the BKMR model, we use the kmbayes function.

set.seed(123)
fitpr <- kmbayes(y = datp$y, Z = datp$Z, X = datp$X, 
                 iter = 10000, verbose = FALSE, 
                 varsel = TRUE, family = "binomial",
                 control.params = list(r.jump2 = 0.5))
DIR <- ifelse(grepl("my-doc", getwd()), getwd(), paste(getwd(), "my-doc", sep = "/"))
load(paste(DIR, "probit_reg.RData", sep = "/"))
rm(DIR)

The argument family indicates the outcome distribution, which is currently implemented for 'gaussian' and 'binomial'. Note that here we changed the tuning parameter r.jump2 of the Metropolis-Hastings algorithm for updating the $r_m$ parameters under variable selection to get an improved acceptance rate of r 100*with(fitpr, round(mean(acc.rdelta[move.type == 2]), 2))% versus ~65% under the default tuning parameters (details of the tuning parameters are in the overview guide).

summary(fitpr)

Interpretting output

On probit scale

We may wish to interpret the estimated $h$ function directly. We note that $h$ quantifies the relationship between the exposures and the (probit of the) probability of an event ($Y = 1$), holding the covariates ${\bf x}$ fixed. By considering the latent normal formulation above, $h$ may alternatively be interpreted as the relationship between the exposures and some underlying, continuous latent variable $Y^$. For example, if $Y$ is an indicator variable for whether an individual has a particular health outcome, $Y^$ could be interpreted as a latent health marker of health status.

Let's investigate the estimated exposure-response function $h$. Here we plot the univariate relationship h($z_m$), where all of the other exposures are fixed to their median values.

pred.resp.univar <- PredictorResponseUnivar(fit = fitpr, method = "exact")

We use the ggplot2 package to plot the resulting cross section of $h$.

library(ggplot2)
ggplot(pred.resp.univar, aes(z, est, ymin = est - 1.96*se, ymax = est + 1.96*se)) + 
    geom_smooth(stat = "identity") + 
    facet_wrap(~ variable) +
  ylab("h(z)")

As expected based on small posterior inclusion probabilities for $z_2$, $z_3$, and $z_4$, there is no association between these exposures and the outcome, which matches the true data generating distribution. We next compare the estimated exposure response function for $z_1$ estimated under BKMR with that estimated by a probit model assuming linear terms of each of the exposure variables, as well as with an 'oracle' probit model that knows the true form of the exposure-response function, fitted using maximum likelihood:

z1 <- datp$Z[, 1]
x <- drop(datp$X)
oracle <- glm(y ~ z1 + I(z1^2) + x, family = binomial(link = "probit"), data = datp)
lin <- glm(y ~ Z + x, family = binomial(link = "probit"), data = datp)

## predictions under the oracle model
z1_grid <- seq(min(datp$Z[, 1]), max(datp$Z[, 1]), length.out = 50)
hpred_oracle <- predict(oracle, newdata = data.frame(z1 = z1_grid, x = 0), se.fit = TRUE)

## predictions under BKMR
Znew <- cbind(z1 = z1_grid, z2 = median(datp$Z[, 2]), 
              z3 = median(datp$Z[, 3]), z4 = median(datp$Z[, 4]))
hpred_bkmr <- ComputePostmeanHnew(fit = fitpr, Znew = Znew, method = "exact")

## predictions under the model with linear terms
hpred_lin <- predict(lin, newdata = list(Z = Znew, x = rep(0, nrow(Znew))), se.fit = TRUE)

Now let's compare the estimated exposure-response functions $h(z_1)$.

plot(z1_grid, hpred_bkmr$postmean, type = "l", 
     ylim = c(0.95*min(datp$h), max(datp$h)), 
     xlab = expression(z[1]), ylab = expression(h(z[1])))
lines(z1_grid, hpred_oracle$fit, col = "red", lty = 2, lwd = 2)
lines(z1_grid, hfun(z1_grid), col = "blue", lty = 3, lwd = 2)
lines(z1_grid, hpred_lin$fit, col = "orange", lty = 4, lwd = 2)
legend(-1, 6, c("BKMR", "oracle", "truth", "linear"), lwd = 2, 
       col = c("black", "red", "blue", "orange"), lty = 1:4, 
       y.intersp = 0.8)

As expected, we see that the BKMR fit performs better than the model assuming a linear exposure-response relationship, but not as well as the oracle model.

On probability scale

Alternatively, we may wish to interpret the association between the exposures and the (untransformed) probility of the outcome. For this we observe $$ P(Y = 1 \mid {\bf z}, {\bf x}) = \Phi{h(z_{1}, \ldots, z_{M}) + \beta{\bf x}}, $$ so that the probability of the event depends not just on $h$, but also on the particular values of the covariates ${\bf x}$. Thus, to estimate the association between $h$ and the probability of the outcome, we must either fix the covariates or integrate over them. Posterior samples of the predicted probabilities may be obtained using the SamplePred function, in which the user specifies the new Z matrix at which to obtain predictions, as well as a particular value of the vector ${\bf x}$. Here we plot the posterior mean of the predicted probabilities as a function of $z_1$ for particular, fixed values of the covariates and for the other exposures set to zero.

Xnew1 <- quantile(datp$X, 0.1)
Xnew2 <- quantile(datp$X, 0.9)

ptrue1 <- with(datp, pnorm(hfun(z1_grid) + Xnew1*beta.true))
ptrue2 <- with(datp, pnorm(hfun(z1_grid) + Xnew2*beta.true))

pred_samps1 <- SamplePred(fit = fitpr, Znew = Znew, Xnew = Xnew1, type = "response")
pred_samps2 <- SamplePred(fit = fitpr, Znew = Znew, Xnew = Xnew2, type = "response")
pred_ests1 <- colMeans(pred_samps1)
pred_ests2 <- colMeans(pred_samps2)

par(mfrow = c(1, 2))
plot(z1_grid, pred_ests1, type = "l", ylim = range(ptrue1),
     xlab = expression(z[1]), ylab = expression(P(Y == 1)))
lines(z1_grid, ptrue1, col = "blue", lty = 3, lwd = 2)
legend("bottomright", c("BKMR", "truth"), lwd = 2, 
       col = c("black", "blue"), lty = c(1,3), 
       y.intersp = 0.8, cex = 0.8)
plot(z1_grid, pred_ests2, type = "l", ylim = range(ptrue1),
     xlab = expression(z[1]), ylab = expression(P(Y == 1)))
lines(z1_grid, ptrue2, col = "blue", lty = 3, lwd = 2)

To integrate over the covariates, we use the fact that, under the assumed probit model, the probability of the outcome given the exposure variables ${\bf z}$ may be expressed as $$ P(Y = 1 \mid {\bf z}) = \mbox{E}[P(Y = 1 \mid {\bf z}, {\bf x}) \mid {\bf z}] = \mbox{E}[\Phi{h(z_{1}, \ldots, z_{M}) + \beta{\bf x}} \mid z_{1}, \ldots, z_{M}]. $$ We estimate this quantity as $$ \frac{1}{n}\sum_{i = 1}^n\Phi{h(z_{1}, \ldots, z_{M}) + \beta{\bf x_i}}. $$ Currently, built-in functions to do this have not yet been implemented.



jenfb/bkmr documentation built on March 26, 2022, 8:30 p.m.