knitr::opts_chunk$set(echo = TRUE)
The R package pgmultinomr provides a function to fit a Bayesian categorical regression model using polya-gamma (PG) data augmentation for efficient Gibbs inference. The categorical outcome data may be complete, or it may contain partially or fully missing outcomes. For more details on the method, see [Hoskovec et al.]. The following functions are available in the package pgmultinomr:
|function | brief description of usage | |------------------|------------------------------------| pgmultinom | fit Bayesian categorical regression with K>2 using PG data augmentation | sim_dat | simulate data for categorical regression with possible missing outcomes | sim_study | reproduce simulation study in [Hoskovec et al.] for a single data set| get_odds_ratio | calculate posterior distribution of odds ratio as a function of exposures |
In this vignette, we show how to install the package, fit the model, and post-process and visualize the results for inference. We also demonstrate how to reproduce the simulation study in [Hoskovec et al.].
First, load the following package dependencies. Use install.packages() to install them, if needed.
library(Rcpp) library(RcppArmadillo) library(mvnfast) library(pgdraw)
The package can then be installed from gitHub.
devtools::install_github("lvhoskovec/pgmultinomr", build_vignettes = TRUE) library(pgmultinomr)
We briefly describe the model here. For more details, see [Hoskovec et al.].
For a sample $i = 1, \ldots, n$, let $\mathbf{y}i$ denote the $K$-dimensional vector indicating to which of $K$ possible outcome categories individual $i$ belong. We model $\mathbf{y}_i$ with a multinomial distribution where the number of trials is $1$. Using the stick-breaking representation of the multinomial distribution, we model the complete data for individuals $i = 1, \ldots, n$ by $$\begin{eqnarray} \mathbf{y}_i & \sim & \prod{k=1}^{K-1} \text{binom}(y_{ik}|N_{ik}, \tilde{\pi}{ik}), \label{stick} \end{eqnarray}$$ where $N{i1} = 1$ and $N_{ik} = 1 - \sum_{j<k}y_{ij}$. We model each $\tilde{\pi}{ik}$ for $i = 1, \ldots, n$ and $k = 1, \ldots, K-1$ using a logit link function of the exposures and covariates. The logit link for the stick-specific weights is given by \begin{eqnarray} \tilde{\pi}{ik} & = & \frac{e^{\psi_{ik}}}{1 + e^{\psi_{ik}}} \ \psi_{ik} & = & \mathbf{x}_i^T \boldsymbol{\beta}_k + \mathbf{w}_i^T \boldsymbol{\gamma}_k, \end{eqnarray} where $\boldsymbol{\beta}_k$ and $\boldsymbol{\gamma}_k$ are category-specific regression coefficients for the exposures and covariates, respectively. Latent polya-gamma random variables are introduced so that the conditional likelihood has a Gaussian kernel. Hence, we place multivariate normal priors on $\boldsymbol{\beta}_k$ and $\boldsymbol{\gamma}_k$ for efficient Gibbs sampling inference. For $k = 1, \ldots, K - 1$,
$$\begin{eqnarray} \boldsymbol{\beta}_k & \sim & N(\mathbf{b}_k, \mathbf{B}_k) \ \boldsymbol{\gamma}_k & \sim & N(\mathbf{g}_k, \mathbf{G}_k). \end{eqnarray}$$
The default priors are standard normal.
We first use the sim_dat
function included the package to simulate data. The sim_dat
function includes the following parameters: n
is the number of observations, miss_prob
is the proportion of observations with partially or fully missing outcomes, K
is the number of outcome categories, p
is the number of exposures, q
is the number of covariates, and covX
is the covariance matrix of the exposure data. Three remaining parameters determine the simulation design: 1) allmiss
is logical; if TRUE, observations with missing outcome data are missing values for all K categories (fully missing), if FALSE, they are missing values for between 2 and K-1 categories (partially missing), 2) null_scenario is logical; if TRUE, the regression coefficients are all set to 0 (null), if FALSE, the regression coefficients are all simulated from independent standard normal distributions (signal), and 3) equal_probs is logical; if TRUE, category-specific intercepts are set so each category has roughly equal amounts of data (referred to as "equal probabilities" in manuscript), if FALSE, intercepts are set to mimic the outcome category probabilities from the data analysis in [Hoskovec et al.] (referred to as "data probabilities" in manuscript).
We generate $n=1000$ observations with $p=3$ correlated exposures, $q=5$ independent covariates plus an intercept, and $K=6$ outcome variables. To mimic the simulation study in [Hoskovec et al.], we generate exposure data from a multivariate normal distribution with mean 0 and covariance matrix covX
given by
covX = matrix(c(1, -0.13, 0.02, -0.13, 1, -0.86, 0.02, -0.86, 1), 3, 3)
Outcome data are simulated from the stick-breaking representation of the multinomial distribution. In this example, we simulate data for the scenario with data probabilities, partially missing outcomes, and a signal. We set the missing data level to 50$\%$. The following code simulates the data for our example:
n = 1000 K = 6 p = 3 q = 5 dat = sim_dat(n=n, miss_prob = 0.5, K=K, p=p, q=q, covX = covX, allmiss = FALSE, null_scenario = FALSE, equal_probs = FALSE)
The object dat
is a list with elements: y
the simulated data with missing values, ycomplete
the complete data, x
the exposure data, w
the covariate data, beta_true
the true exposure regression coefficients, and gamma_true
the true covariate regression coefficients.
Next we demonstrate how to fit the Bayesian multinomial regression model using the pgmultinom
function. The function pgmultinom
takes the following parameters: niter
is the total number of MCMC iterations, priors
is a list of priors (see below for details), y
is the categorical outcome data (with missing any values), x
is the exposure data matrix, w
is the (optional) covariate data matrix, and intercept
is logical dictating if an intercept term should be estimated. The model requires defining exposure data x
, whereas covariate data w
are optional. We include the 3 exposures simulated above and all pairwise interactions in x
. We also include our simulated independent covariate data in w
. We fit the model by
niter = 1000 ints <- combn(1:ncol(dat$x), 2) z <- apply(ints, 2, FUN = function(a) { dat$x[,a[1]] * dat$x[,a[2]] }) x = cbind(dat$x,z) fit = pgmultinom(niter=niter, priors=NULL, y = dat$y, x = x, w = dat$w, intercept = TRUE)
Interpretation of the regression coefficients in the stick-breaking representation of the multinomial distribution is challenging. We present a visualization approach and base inference on the exposure-response function. First, we plot the posterior distribution of the estimated exposure regression coefficients. The following code uses the R packages ${\tt ggplot2}$ and ${\tt cowplot}$. Install if needed.
library(ggplot2) library(cowplot)
nburn = niter/2 beta_quantiles = t(apply(fit$beta.vec[(nburn+1):niter,], 2, FUN = function(b){ return(c(quantile(b, 0.025), mean(b), quantile(b, 0.975)))})) namesX = c("exposure1", "exposure2", "exposure3", "exp1exp2", "exp1exp3", "exp2exp3") namesY = c("cat1", "cat2", "cat3", "cat4", "cat5") mix_df = cbind(data.frame(exp(beta_quantiles)), rep(namesX,K-1)) colnames(mix_df) = c("lwr", "mean", "upr", "exposure") plot_list = list() for(p in 1:length(namesX)){ data_plot = mix_df[which(mix_df$exposure==namesX[p]),] limits = c(min(data_plot$lwr), max(data_plot$upr)) # make a single exposure plot mix_plot = ggplot() + geom_point(data = data_plot, mapping = aes(x = 1:(K-1), y = mean), size = 3) + geom_errorbar(data = data_plot, mapping = aes(x = 1:(K-1), ymin = lwr, ymax = upr),width = 0.4) + geom_hline(yintercept = 1, colour = "red") + scale_x_discrete(name = "", breaks = factor(1:(K-1)), limits = factor(1:(K-1)), label = namesY) + scale_y_continuous(name = "exp(beta)", limits = limits) + theme(text = element_text(size = 12)) + ggtitle(namesX[p]) # save plots in a list plot_list[[p]] = mix_plot }
# plot all exposures plot_grid( plot_list[[1]] + theme(legend.position="none"), plot_list[[2]] + theme(legend.position="none"), plot_list[[3]] + theme(legend.position="none"), plot_list[[4]] + theme(legend.position="none"), plot_list[[5]] + theme(legend.position="none"), plot_list[[6]] + theme(legend.position="none"), align = 'h', labels = "", hjust = -1, nrow = 2 )
Next we show how to visualize the odds ratio for each category relative to a reference category as a function of each exposure. For more details on this inferential approach, see Section 3 in [Hoskovec et al.].
The matrix x.star
includes a sequence of values for a primary exposure, and sets the secondary exposures to a fixed percentile. The matrix x.base
includes the primary exposure set to its mean value, and the secondary exposures set to the same fixed percentile. Interactions between exposures are then calculated. The following is an example of how to make x.star
and xbase
. In this example, exposure 1 is the primary exposure. The matrix x.star
includes a sequence of 100 values for exposure 1 ranging from -1 to 1. The secondary exposures, exposures 2 and 3, are set to their 50th percentiles. The matrix x.base
includes the primary exposure set to its mean and the secondary exposures set to their 50th percentiles. The following code makes x.star
and x.base
for this example. The code can be adapted as needed.
len = 100 x.star = matrix(0, nrow = len, ncol = ncol(x)) x.star[,1] = seq(-1,1,length.out = len) # sequence of exposure 1 values x.star[,2] = quantile(x[,2], 0.50) # set exposure 2 to a fixed percentile x.star[,3] = quantile(x[,3], 0.50) # set exposure 3 to a fixed percentile # calculate the interactions x.star[,4] = x.star[,1]*x.star[,2] x.star[,5] = x.star[,1]*x.star[,3] x.star[,6] = x.star[,2]*x.star[,3] # make x.base x.base = matrix(0, nrow = 1, ncol = ncol(x)) x.base[,1] = mean(x[,1]) # set exposure 1 to 0 x.base[,2] = quantile(x[,2], 0.50) # set exposure 2 to a fixed percentile x.base[,3] = quantile(x[,3], 0.50) # set exposure 3 to a fixed percentile # calculate the interactions x.base[,4] = x.base[,1]*x.base[,2] x.base[,5] = x.base[,1]*x.base[,3] x.base[,6] = x.base[,2]*x.base[,3]
Next, we use the function get_odds_ratio
, provided in the R package, to create the posterior distribution of the odds ratio using x.star
and x.base
. The function get_odds_ratio
takes the following parameters: x.star
is x.star
as described above, x.base
is x.base
as described above, beta_posterior
is the list of posterior estimates of exposure regression coefficients (beta) from the model fit, niter
is the total number of MCMC iterations, nburn
is the number of burn-in iterations, K
is the number of outcome categories, refK
is the selected reference category. As described in Section 3 in [Hoskovec et al.], this method allows any category to be selected as the reference category. We select category 2 as the reference category in the example below.
# calculate OR(x.star, x.base) refK = 2 odds_ratio = get_odds_ratio(x.star = x.star, x.base = x.base, beta_posterior = fit$beta, niter = 1000, nburn = 500, K = 6, refK = refK)
The object odds_ratio
is a list of the posterior mean, .025 percentile, and .975 percentile of the odds ratio as a function of x.star
and x.base
for each of the $K-1$ categories relative to the reference category. We provide some code below to plot the posterior distribution of the estimated odds ratio as a function of exposure 1 relative to the mean of exposure 1, for each of the outcome categories, relative to the reference category.
# make data frame of odds ratios or_df = data.frame(xvals = x.star[,1], matrix(unlist(odds_ratio), nrow = len)) # set column names or_names = "xvals" for(k in 1:(K-1)){ k_names = c("lwr", "mean", "upr") or_names = c(or_names,k_names) } colnames(or_df) = or_names # make K-1 plots of odds ratio for each category plot_or = list() for(k in 1:(K-1)){ # get data for category k col_nums = 1 + (3*k-2):(3*k) df_k = or_df[,c(1,col_nums)] # plot OR for category k_num k_num = ((1:K)[-refK])[k] plotk = ggplot(data = df_k) + ggtitle(paste0("cat",k_num)) + geom_line(aes(x = xvals, y = mean), size = 1.5) + geom_ribbon(mapping = aes(x = xvals, ymin = lwr, ymax = upr), fill = "black", alpha = .4, show.legend = FALSE) + geom_line(aes(x = xvals, y = 1), linetype = "dotted", size = 1.5) + scale_x_continuous(name = "exposure1") + scale_y_continuous(name = "estimated OR relative to mean") + theme(text = element_text(size = 12), axis.title.y = element_text(size = 8)) plot_or[[k]] = plotk }
# K-1 OR plots in a panel plot_grid( plot_or[[1]] + theme(legend.position="none"), plot_or[[2]] + theme(legend.position="none"), plot_or[[3]] + theme(legend.position="none"), plot_or[[4]] + theme(legend.position="none"), plot_or[[5]] + theme(legend.position="none"), align = 'h', labels = "", hjust = -1, nrow = 2 )
The user can specify different prior distributions for the exposure and covariate regression coefficients. These are:
|control parameter | model parameter | default value | description | |------------------|-----------------|---------------|------------------| |betamean |$\mathbf{b}_k$ | $\mathbf{0}$ |mean parameter for multivariate Normal prior on $\boldsymbol{\beta}_k$, $k=1, \ldots, K-1$ | |betavar |$\mathbf{B}_k$ | $\mathbf{I}$ |variance parameter for multivariate Normal prior on $\boldsymbol{\beta}_k$, $k=1, \ldots, K-1$ | |gammamean |$\mathbf{g}_k$ | $\mathbf{0}$ |mean parameter for multivariate Normal prior on $\boldsymbol{\gamma}_k$, $k=1, \ldots, K-1$ | |gammavar |$\mathbf{G}_k$ | $\mathbf{I}$ |variance parameter for multivariate Normal prior on $\boldsymbol{\gamma}_k$, $k=1, \ldots, K-1$ |
The package includes the function sim_study
to reproduce the simulation study in [Hoskovec et al.]. See Section 4 in [Hoskovec et al.] for more details on the simulation study. Here we demonstrate how to use sim_study
.
The function sim_study
takes in the following parameters: simnum
is the simulation number (which is used to set a seed), niter
is the total number of MCMC iterations to run the model, nburn
is the number of burn-in iterations, n
is the sample size, and miss_prob
is the proportion of observations with missing outcome data. The three remaining parameters determine the simulation design as described above. To reiterate: 1) allmiss
is logical; if TRUE, observations with missing outcome data are missing values for all K categories (fully missing), if FALSE, they are missing values for between 2 and K-1 categories (partially missing), 2) null_scenario is logical; if TRUE, the regression coefficients are all set to 0 (null), if FALSE, the regression coefficients are all simulated from independent standard normal distributions (signal), and 3) equal_probs is logical; if TRUE, category-specific intercepts are set so each category has roughly equal amounts of data (referred to as "equal probabilities" in the manuscript), if FALSE, intercepts are set to mimic the outcome category probabilities from the data analysis in [Hoskovec et al.] (referred to as ``data probabilities" in the manuscript). Other parameters for the model fit are set to that described in Section 3 in [Hoskovec et al.].
Below we demonstrate running the simulation study for the scenario: data probabilities, partially missing outcomes, and a signal in the regression coefficients. We specify a missing data level of 50$\%$. The function fits the pgmultinom
model as described above and imputes data for missing outcomes. It also fits the model to the subset of complete cases.
sim_example = sim_study(simnum = 1, niter = 1000, nburn = 500, n = 1000, miss_prob = 0.5, allmiss = FALSE, null_scenario = FALSE, equal_probs = FALSE)
The object sim_example
is a list containing two elements: results
contains the results of the proposed method with imputation and complete_case_results
contains the results of the complete case analysis.
The object sim_example$results
is a data frame containing (root mean squared error (RMSE), bias, 95$\%$ credible interval width (ci_width), and coverage of the 95$\%$ credible interval (cov) for estimated exposure (beta) and covariate (gamma) regression coefficients, as well as precision and recall for the 6 outcome categories.
round(sim_example$results,2)
The object sim_example$complete_case_results
is a data frame containing (root mean squared error (RMSE), bias, 95$\%$ credible interval width (ci_width), and coverage of the 95$\%$ credible interval (cov) for estimated exposure (beta) and covariate (gamma) regression coefficients in the complete case analysis.
round(sim_example$complete_case_results,2)
As you can see, we obtain estimation gains from our proposed method over the complete case analysis as evidenced by smaller RMSE and credible interval width for exposure and covariate regression coefficients.
For more information, questions, or comments, contact the package maintainer Lauren Hoskovec at lvheck@rams.colostate.edu.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.