In eehh-stanford/yada: Yet Another Demographic Analysis package

knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)

\newcommand{\greekbf}[1]{\boldsymbol{\mathrm{#1}}} \newcommand \diff{\mathop{}!\mathrm{d}}

Introduction

Cumulative probit models provide a way to parameterically model ordinal data, such as data on party identification and skeletal age categories -- for example, see Chapter 8 of Jackman 2009 [Bayesian Analysis for the Social Sciences] and Kongisberg 2015 [Multivariate cumulative probit for age estimation using ordinal categorical data]. To our knowledge, however, no method exists for analyzing mixed sets of responses variables that can be both ordinal and continuous, though Gueorguieva and Agresti 2001 [A Correlated Probit Model for Joint Modeling of Clustered Binary and Continuous Responses] describe a two variable model involving a single continuous variable and a single binary variable. This vignette describes a mixed cumulative probit model implemented in the R package 'yada'. By mixed we mean that response variables can be either continuous or ordinal. The package supports, in principle, any number of variables and any numbers of categories for each ordinal variable. We allow only one independent variable and utilize a specific parametric form for the dependence of latent dependent variables on the independent variable, but generalizations that relax these assumptions would be straightforward.

Application

Although this model and associated R code should have wide application, we had a specific application in mind when we developed it: forensic juvenile age estimation. The independent variable is $x$, age, and the dependent variables can be a mix of continuous skeletal traits (for example, femur length) and ordinal skeletal traits (for example, epyphiseal fusion). Ultimately, our goal is to predict posterior age given skeletal traits, which requires a specification of the prior probability $p(x|\alpha)$, where $\alpha$ is a hyperparameter that parameterizes the prior, and a model for the likelihood, $p(\theta|x)$. We adopt a fully Bayesian approach in which we sample the posterior distribution $p(\theta|D,\alpha)$ for $\theta$ using an adapative Metropolis algorithm, where $D$ stands for observed data.

Preliminaries

Let $x$ be a (scalar) independent variable and let $\mathbf{v}^$ and $\mathbf{w}$ be, respectively, a vector of latent ordinal response variables indexed by $j=1,2,\cdots J$ and a vector of directly observed continuous variables indexed by $k=1,2,\cdots,K$. We assume the overall response vector $\mathbf{y}^ = [\mathbf{v}^*,\mathbf{w}]^T$ is distributed as

$$\mathbf{y}^*_n \sim \mathbf{f}(x_n) + \mathcal{N}(0,\Sigma)$$

where $n=1,2,\cdots,N$ indexes observations, $\mathbf{f} = [\mathbf{g},\mathbf{h}]^T$ is a mean vector that depends on $x$, $\mathbf{g}(x)$ is the mean vector for ordinal response variables, $\mathbf{h}(x)$ is the mean vector for continuous response variables, and $\Sigma$ is the covariance matrix for the normal draw. For identifiability, the diagonal values of $\Sigma$ are $1$. For elements of $\mathbf{g}$, we adopt the parametric form

$$g_j(x) = \alpha_j \, x^{1 - \rho_j}$$

and for elements of $\mathbf{h}$ we adopt the parametric form

$$h_k(x) = a_k \, x^{1 - r_k} + b_k$$

The intercept term is omitted for ordinal variables to ensure identifiability. For each ordinal variable $j$, only the category ($m_j=0,1,\cdots,M_j$) is directly observed, and we model the mapping between the latent vector $\mathbf{v}^*$ and observed ordinal categories $\mathbf{v}$ via the threshold parameters $\greekbf{\tau}^{(j)}$ [$j=1,2,\cdots,J$].

$$v_{j,n} = 0 \Longleftrightarrow v^{j,n} <= \tau^{(j)}_1$$ $$v{j,n} = m \Longleftrightarrow \tau^{(j)}_m < v^{j,n} <= \tau^{(j)}{m+1}$$ $$v_{j,n} = M \Longleftrightarrow v^*_{j,n} > \tau^{(j)}_M$$

where here (and elsewhere) we drop the indices on $m_j$ and $M_j$ as (when) they are clear from context. Finally, as we expand on below, we assume that observations $x_n$ are independently drawn from a probability distribution parameterized by $\greekbf{\theta}_x$.

Bayesian inference

The full set of parameters to be estimated for the model described in the preceding section is $\greekbf{\theta} = [\greekbf{\theta}x,\greekbf{\theta}_y]$, where $\greekbf{\theta}_y = [\greekbf{\alpha},\greekbf{\rho},\mathbf{a},\mathbf{r},\mathbf{b},\greekbf{\tau},\mathbf{s}]^T$, $\greekbf{\tau} = [\greekbf{\tau}^{(1)},\cdots,\greekbf{\tau}^{(J)}]$ is the complete vector of threshold parameters and $\mathbf{s} = [\Sigma{12},\Sigma_{13},\cdots,\Sigma_{1J},\Sigma_{23},\cdots]^T$ is the vector of ${J+K \choose 2}$ unique off-diagonal terms in the covariance matrix (unwrapped by row). We assume that both the prior and likelihood are conditionally separable. That is:

$$p(\greekbf{\theta}|\greekbf{\alpha}) = (\greekbf{\theta}_x|\greekbf{\alpha}_y) \, (\greekbf{\theta}_x|\greekbf{\alpha}_y)$$

and

$$p(\mathbf{x},\mathbf{Y}|\greekbf{\theta}) = p(\mathbf{Y}|\greekbf{\theta}_y,\mathbf{x}) \, p(\mathbf{x}|\greekbf{\theta}_x)$$

where $\greekbf{\alpha} = [\greekbf{\alpha}_x,\greekbf{\alpha}_y]^T$ is a hyperparameter that parameterizes the prior distribution. Given this assumption of conditional separability and using $D = {\mathbf{x},\mathbf{Y}}$ for all the observed data, posterior inference on $\greekbf{\theta}$ simplifies to

$$p(\greekbf{\theta}|D,\greekbf{\alpha}) = \frac{\left[ p(\greekbf{\theta}y|\greekbf{\alpha}_y) \displaystyle \prod{n=1}^{N} p(\mathbf{y}n|x_n,\greekbf{\theta}_y) \right] \left[p(\greekbf{\theta}_x|\greekbf{\alpha}_x) \displaystyle \prod{n=1}^{N} p(x_n|\greekbf{\theta}_x)\right]}{p(D|\greekbf{\alpha})}$$

Similarly, Bayesian inference for new observations simplifies to

$$p(x'|\mathbf{y}',D,\greekbf{\alpha}) = \frac{\left[\int_{\Theta_y} p(\greekbf{y}'|\greekbf{\theta}y,x') \, p(\greekbf{\theta}_y|D,\greekbf{\alpha}) \, \diff\greekbf{\theta}_y\right] \left[\int{\Theta_x} p(x'|\greekbf{\theta}_x) \, p(\greekbf{\theta}_x|D,\greekbf{\alpha}) \, \diff\greekbf{\theta}_x\right]}{p(\mathbf{y}'|D,\greekbf{\alpha})}$$

where $\mathbf{y}'$ is a new set of observed outcome variables and $x'$ is the corresponding independent variable. That is, given a set of forensic observations ($\mathbf{y}'$) predict age ($x'$). In summary, there are two problems to be solved. First, posterior inference on $\greekbf{\theta}$ must be done given a set of training data. We do so by sampling from the posterior of $\greekbf{\theta}$ using an adaptive Metropolis algorithm. Given our assumption of conditional independence, $\greekbf{\theta}_y$ can be sampled independently from $\greekbf{\theta}_x$. Second, unknown $x'$ (age) must be predicted from a new set of observations $\mathbf{y}'$ (forensic data). Since missing data are common in many applications, we allow for missing data in both training and prediction for new observations.

Likelihood of $\greekbf{\theta}_y$

Sampling from the posterior distribution of $p(\greekbf{\theta}_y|D,\greekbf{\alpha})$ requires calculation of the likelihood

$$L^{(y)} = \displaystyle \prod_{n=1}^{N} p(\mathbf{y}n|x_n,\greekbf{\theta}_y) = \displaystyle \prod{n=1}^{N} l^{(y)}_n$$

and prior $p(\greekbf{\theta}_y|\greekbf{\alpha}_y)$ for each observation $n$. In turn, calculation of the individual likelihoods, $l^{(y)}_n$, requires calculation of a multivariate Gaussian integral with (possibly infinite) rectangular limits and a dimension less than or equal to $J+K$. To specify this integral precisely, let $\mathbf{c}_1$ be a vector of variables known to lie between the limits $\greekbf{\tau}^{(c_1-)}$ and $\greekbf{\tau}^{(c_1+)}$ and let $\mathbf{c}_2$ be a vector of known variables for which

$$[\mathbf{c}_1,\mathbf{c}_2]^T \sim \mathcal{N}(\greekbf{\mu},\Sigma)$$

where the mean has components $\greekbf{\mu}^T = [\greekbf{\mu}1,\greekbf{\mu}_2]^T$ and $\Sigma$ has the corresponding blocks $\Sigma{11}$, $\Sigma_{22}$, and $\Sigma_{12} = \Sigma_{21}^T$. Conditioning on the known vector $\mathbf{c}_1$ allows a decomposition of the probability density of the vector $\mathbf{c}=[\mathbf{c}_1,\mathbf{c}_2]^T$,

$$\phi(\mathbf{c};\greekbf{\mu},\Sigma) = \phi(\mathbf{c}2;\greekbf{\mu}_2,\Sigma{22}) \phi(\mathbf{c}1;\bar{\greekbf{\mu}}_1,\bar{\Sigma}{22})$$

where $\phi$ is the probability density function of the multivariate Gaussian distribution and

$$\bar{\greekbf{\mu}}1 = \greekbf{\mu}_1 + \Sigma{12} \Sigma_{22}^{-1} (\mathbf{c}_2 - \greekbf{\mu}_2)$$

and

$$\bar{\Sigma}{11} = \Sigma{11} - \Sigma_{12} \Sigma_{22}^{-1} \Sigma_{12}^T$$

In general, $\mathbf{c}_1$ corresponds to ordinal variables with the limits set by the observed category and $\mathbf{c}_2$ corresponds to the continuous variables. However, for missing observations the variable is simply assumed to lie on the interval $-\infty$ to $-\infty$ and if all continuous variables are missing for an observation then $\mathbf{c}_1$ has length $J+K$ and $\mathbf{c}_2$ length $0$. The likelihood of the observation $[\mathbf{c}_1,\mathbf{c}_2]^T$ is

$$ l(\mathbf{c}1,\mathbf{c}_2) = \phi(\mathbf{c}_2;\greekbf{\mu}_2,\Sigma{22}) \int_{\greekbf{\tau}^{(c)}} \phi(\mathbf{c}1;\bar{\greekbf{\mu}}_1,\bar{\Sigma}){11} \diff\mathbf{c}_1 $$

where ${\greekbf{\tau}^{(c)}}$ is shorthand for the integration limits.

Sampling $\greekbf{\theta}_y$

To sample from $\greekbf{\theta}_y$, we utilize the adaptive Metropolis algorithm described in Haario et al. (2001) [An adaptive Metropolis algorithm]. The algorithm adapts the proposal distribution based on the full history of samples, with a burn-in period prior to adaption that has a fixed proposal distribution. The posterior probability is proportional to

$$\eta(\mathbf{y};\greekbf{\theta}_y,\greekbf{\alpha}_y) = L^{(y)}(\mathbf{y};\greekbf{\theta}_y) \, p(\greekbf{\theta}_y|\greekbf{\alpha}_y)$$

We adopt a conservative, flat prior for $\greekbf{\theta}_y$ so that $p(\greekbf{\theta}_y|\greekbf{\alpha}_y)$ can be assumed constant. The candidate distribution is a normal draw with a mean of the previous sample at step $t-1$ and covariance equal to

$$C_t = \left{ \begin{array}{ll}C_0 \mbox{,} & t \leq t_0 \ s_d \, \mathrm{cov}(\greekbf{\theta}_y^{(0)},\cdots,\greekbf{\theta}_y^{(t-1)}) + s_d \, \epsilon I_d & t > t_0 \end{array} \right.$$

where $s_d$ depends on the dimension $d$ of $\greekbf{\theta}_y$ and $I_d$ is the identity matrix of dimension $d$. The term with the parameter $\epsilon$ ensures that $C_t$ is never singular. Following Gelman et al. (1996) [Efficient Metropolis jumping rules] we use $s_d = (2.4)^2 / d$. Efficient recursive algorithms exist to recursively update the covariance of the proposal distribution.

Fitting single variables

yada provides tools for working with observations of single ordinal or continuous variables. simGenCrra creates simulated data with a mean of $g(x) = a x^{1-r} + b$ with noise added via a draw from the normal distribution with a standard deviation of $1$. Create a vector of exponentially distributed observations $x$ and call the simGenCrra function (setting the random number seed for reproducible results) with $a=4$, $r=.7$, and $b=-2$:

library(yada)
set.seed(971484) # Seed from random.org between 1 and 1,000,000
x <- rexp(100)
paramTrue <- c(4,.75,-2)
y <- simGenCrra(x,paramTrue)

Fit this data and then plot the resulting fit along with the data and true (simulated) curve:

plot(x,y,xlab='x',ylab='y')
xv <- seq(0,max(x),len=500)
param <- fitGenCrra(x,y)
lines(xv,genCrra(xv,paramTrue),col='green',lwd=2)
lines(xv,genCrra(xv,param),col='red',lwd=2)

Ordinal data can be simulated with simGenCrraOrd. By default, only the ordinal categories are returned, but both the ordinal categories and latent underlying continuous trait can be returned by setting the input latent variable to True. Simulate data using $\alpha=4$, $\rho=.75$, $\tau_1=2.5$, and $\tau_2=4$:

paramTrue <- c(4,.75,2.5,4)
ordData <- simGenCrraOrd(x,paramTrue,latent=T)
print(names(ordData))
print(table(ordData$y))

Fit this data and then plot the resulting fit along with the latent data and true (simulated) curve:

plot(x,ordData$ystar,xlab='x',ylab='ystar')
param <- fitGenCrraOrd(x,ordData$y)
lines(xv,genCrraOrd(xv,paramTrue),col='green',lwd=2)
lines(xv,genCrraOrd(xv,param),col='red',lwd=2)
#lines(c(0,max(x)),c(1,1)*paramTrue[3],col='green',lwd='2',lty=2)
#lines(c(0,max(x)),c(1,1)*paramTrue[4],col='green',lwd='2',lty=2)
#lines(c(0,max(x)),c(1,1)*param[3],col='red',lwd='2',lty=2)
#lines(c(0,max(x)),c(1,1)*param[4],col='red',lwd='2',lty=2)

Fitting ordinal data is more challenging since only the ordinal categories are observed. With sufficient data, however, the parameters can be accurately estimated. Repeat the preceding steps for $10000$ observations:

x2 <- rexp(10000)
xv2 <- seq(0,max(x2),len=500)
ordData2 <- simGenCrraOrd(x2,paramTrue,latent=T)
plot(x2,ordData2$ystar,xlab='x',ylab='ystar')
param2 <- fitGenCrraOrd(x2,ordData2$y)
lines(xv2,genCrraOrd(xv2,paramTrue),col='green',lwd=2)
lines(xv2,genCrraOrd(xv2,param2),col='red',lwd=2)

Bayesian inference on simulated data

We now demonstrate how to implement the Bayesian inference describing above on simulated data. First, define the model by specifying the number of ordinal and continuous variables (and number of categories per ordinal variable) in a hyperparameter variable, hp, and setting the true values of $\greekbf{\theta}$:

N <- 1000 # number of observations / individuals
hp <- list(paramModel='genCrra') # The parametric model to use for the mean
hp$varRange <- matrix(c(0,0,-Inf,-Inf,4,4,Inf,Inf),ncol=2)
hp$J <- 2 # number of ordinal variables
hp$K <- 2 # number of continuous variables
hp$M <- c(4,4) # Number of categories for each ordinal variable

theta_y0_list <- list(paramModel='genCrra')
theta_y0_list$alpha <- c(2,3)
theta_y0_list$rho <- c(.4,.7)
theta_y0_list$a <- c(3,1.75)
theta_y0_list$r <- c(-.25,.6)
theta_y0_list$b <- c(2,-4)
theta_y0_list$tau[[1]] <- c(1,2,3.5,4.5)
theta_y0_list$tau[[2]] <- c(0,2,3,5.5)
theta_y0_list$Sigma <- diag(4)
theta_y0_list$Sigma[1,4] <- .2
theta_y0_list$Sigma[4,1] <- theta_y0_list$Sigma[1,4]
theta_y0_list$Sigma[3,4] <- -.3
theta_y0_list$Sigma[4,3] <- theta_y0_list$Sigma[3,4]

$\greekbf{\theta}_y$ can be represented as either a list, which is more readable for humans, or a vector. Conversion between representations is done with theta_yList2Vect and theta_yVect2List.

theta_y0 <- theta_yList2Vect(theta_y0_list)
print(theta_y0_list)
print(theta_y0)
print(theta_yList2Vect(theta_yVect2List(theta_y0,hp)))

Next, create simulated data with $1000$ observations and a missing data proportion of $0.05$ by calling simMixedCumProbit:

  theta_x <- 1/3
  sim <- simMixedCumProbit(theta_y0_list,N,theta_x,hp,0.05)
  print(names(sim))
  print(dim(sim$Y))
  print(sum(is.na(sim$Y)) / prod(dim(sim$Y)))

In the variable sim, x is the vector of independent observations, Y the matrix of actual observations, Ystar the matrix of latent observations, and Y0 the matrix of actual observations with no missing data.

library(doParallel)
registerDoParallel(detectCores()-2)
samp <- sample_theta_y(sim$x,sim$Y,hp,100,2000,verbose=F,start=theta_y0_list)
bundle <- bundle_theta_y_sample(samp$theta_yList,samp$logLikVect,sim$x,sim$Y,hp,varNames=c('ord1','ord2','cont1','cont2'),known=list(theta_y=theta_y0_list))
print(names(bundle))

Now plot alpha samples:

theta_y_plot_sample_variable(bundle,'alpha')

Now plot a samples:

theta_y_plot_sample_variable(bundle,'a')

Now plot tau for first ordinal variable:

theta_y_plot_tau(bundle,1)

Now plot all correlations for first ordinal variable:

theta_y_plot_corr(bundle)

Repeat without setting start

samp2 <- sample_theta_y(sim$x,sim$Y,hp,100,10000,verbose=F)
bundle2 <- bundle_theta_y_sample(samp2$theta_yList,samp2$logLikVect,sim$x,sim$Y,hp,varNames=c('ord1','ord2','cont1','cont2'),known=list(theta_y=theta_y0_list))
print(names(bundle))

Now plot alpha samples:

theta_y_plot_sample_variable(bundle2,'alpha')

Now plot a samples:

theta_y_plot_sample_variable(bundle2,'a')

Plot six test observations:

simTest <- simMixedCumProbit(theta_y0_list,6,theta_x,hp,0.05)
xv3 <- seq(0,10,len=1000)
thBest <- samp2$theta_yList[[which.max(samp$logLikVect)]] # Roughly maximum likelihood
theta_x <- 1/mean(sim$x)
print(simTest$x)
print(simTest$Y)
par(mfrow=c(3,2))
for(n in 1:length(simTest$x)) {
  plot_mixed_cum_probit_posterior(thBest,theta_x,xv3,simTest$Y[,n],hp,simTest$x[n])
}