library(magrittr) library(dplyr) library(tibble) library(ggplot2) library(rogali) knitr::opts_chunk$set( collapse = TRUE, comment = "#>" )
Let $\mathcal{J}$ be a set of municipalities, and $\mathcal{I}$ a set of individuals. There are $\mathcal{T}$ time periods. The outcome of interest is $Y_{ijt} \in {0,1,2}$, with:
Define $Z_j \in {0,1}$ the type of municipality $j$. If $Z_j = 0$, municipality $j$ is not corrupt. Otherwise, $j$ is corrupt. Furthermore, define $T_{jt} \in {0,1}$ be municipality $j$'s treatment status at time $t$, with $T_{jt} = 1$ if municipality $j$ has been audited at some period $t' \leq t$, and $T_{jt} = 0$ otherwise. In other words, $T_{jt} = 0$ until $j$'s first audit, and $T_{jt} = 1$ from its first audit. Note that a municipality's type is observed if and only if it has been audited at least once. That is: $$ Z_j \text{ is observed} \iff \max_{t \in \mathcal{T}} T_{jt} = 1 $$ Finally, define $X_i$ a vector of time-invariant employee characteristics and $V_j$ a vector of time-invariant municipality characteristics.
Suppose that $Z_j$ follows the following binomial: $$ \Pr(Z_j = 1 | U_j) = \text{logit} \left(V_j' \gamma \right) $$
Suppose that $Y_{ijt}$ follows the following multinomial: $$ \Pr(Y_{ijt} = k | Z_j, X_i) = \begin{cases} \frac{\exp(\beta_{kt} + \beta_{k0} Z_j + \beta_{k1} T_{jt} + \beta_{k2} Z_j T_{jt} + \beta_{k3} X'i)} {1 + \sum{k \in {1,2}} \exp(\beta_{kt} + \beta_{k0} Z_j + \beta_{k1} T_{jt} + \beta_{k2} Z_j T_{jt} + \gamma X'i)} \text{, if } k \in {1, 2} \ \frac{1} {1 + \sum{k \in {1,2}} \exp(\alpha_{kt} + \beta_{k0} Z_j + \beta_{k1} T_{jt} + \beta_{k2} Z_j T_{jt} + \beta_{k3} X'_i)} \text{, otherwise} \end{cases} $$
In other words, we use a multinomial with year fixed-effects $\beta_{kt}$ (which is equivalent to a discrete-time proportional hazard survival model) and an interaction term between treatment status $T_{jt}$ and the latent type $Z_j$.
We estimate the model using a Gibbs sampler that uses the data-augmentation scheme developped by Polson, Scott and Windle (JASA, 2013). In what follows, let $N(\mu, \sigma^2)$ be a multivariate normal distribution with mean $\mu$ and variance $\sigma^2$, and $PG(a, b)$ be a Polya-Gamma distribution with parameters $a$ and $b$. In this section, we use a pooled notation for the multinomial. Let $Y_{xjt} \equiv (n_{0xjt}, n_{1xjt}, n_{2xjt})$ be a vector counting the number of individuals sharing covariates $x$ in municipality $j$ on year $t$ with outcomes $Y_{ijt} = 0, 1, 2$ respectively, and $n_{xjt} \equiv n_{0xjt} + n_{1xjt} + n_{2xjt}$ be the total number of individuals in this cell. The key to this sampler is to use a specific data-augmentation strategy. Denote $\omega_{\beta kxjt}$ a latent variable for the individuals with outcome $k$ in $y_{xjt}$, and $\omega_{\gamma j}$ a latent variable for municipality $j$.
Parameters have normal priors: $$ \gamma \sim N(\mu_\gamma, \sigma^2_\gamma) \ \beta_k \sim N(\mu_{\beta_k}, \sigma^2_{\beta_k}) $$
Our sampler is the following: $$ \omega_{\gamma j} | \omega_{\beta kxjt}, \gamma, \beta, Z, Y \sim PG(1, U_j' \gamma) \ \gamma | \omega_{\gamma}, \omega_{\beta}, \beta, Z, Y \sim N(m_\gamma, V_\gamma) \ \omega_{\beta kxjt} | \omega_{\gamma}, \gamma, \beta, Z, Y \sim PG(n_{xjt}, \eta_{kxjt}) \ \beta_k | \omega_{\gamma}, \omega_{\beta k}, \gamma, Z, Y \sim N(m_{\beta k}, V_{\beta k}) Z_j | \omega_{\gamma}, \omega_{\beta k}, \gamma, \beta, Y \sim \text{Binom}(\pi_j) $$
with
$$
V_\gamma = (U' \Omega_\gamma U + (\sigma^2_\gamma)^{-1})^{-1} \
m_\gamma = V_\gamma (U' \kappa_\gamma + (\sigma^2_\gamma)^{-1} \mu_\gamma) \
\kappa_\gamma = (Z_1 - 1/2, ..., Z_J - 1/2) \
V_{\beta k} = (X' \Omega_{\beta k} X + (\sigma^2_{\beta_k})^{-1})^{-1} \
m_{\beta k} = V_{\beta k} (X' (\kappa_{\beta k} + \Omega_{\beta k} C_k) + (\sigma^2_{\beta_k})^{-1} \mu_{\beta_k}) \
\pi_j = \frac{\Pr(Z_j = 1 | \gamma) \Pr(Y_j | Z_j = 1, \beta)}
{\Pr(Z_j = 0 | \gamma) \Pr(Y_j | Z_j = 0, \beta) + \Pr(Z_j = 1 | \gamma) \Pr(Y_j | Z_j = 1, \beta)}
$$
Note that $\Omega_v$ is a diagonal matrix with diagonal elements $\omega_{vw}$, and $$ \eta_{kxjt} = X_{xjt}' \beta_k - C_{kxjt} \text{ with } C_{kxjt} = \log \sum_{l \neq k} \exp X_{xjt}' \beta_l $$
df <- expand.grid(idMun = 1:100, year = 1:5) %>% mutate( state = ceiling(idMun/10), state = as.factor(state), treat = as.integer((idMun %% 10) <= 5 & year == (idMun %% 10)), year = as.factor(year) ) %>% arrange(idMun, year) dfMun <- df %>% select(idMun, state) %>% distinct() matMun <- model.matrix(~ state, dfMun) gamma <- c(seq(0.2, 1, .1)-.5, .5) names(gamma) <- paste0("gamma-", colnames(matMun)) Z <- plogis(matMun %*% gamma) Z <- sapply(Z, rbinom, n = 1, size = 1) dfMun$Z <- Z trueZ <- Z df <- df %>% inner_join(dfMun, by = c("idMun" = "idMun", "state" = "state")) mat <- model.matrix(~ state + year + treat*Z, df) beta1 <- c(seq(0.1, 1, .1)-.5, -.2, -.1, .1, .2, 0, -.5, 1) beta2 <- beta1 - .3 names(beta1) <- names(beta2) <- colnames(mat) names(beta1) <- paste0("beta1-", names(beta1)) names(beta2) <- paste0("beta2-", names(beta2)) mb1 <- exp(mat %*% beta1) mb2 <- exp(mat %*% beta2) pr1 <- mb1 / (1 + mb1 + mb2) pr2 <- mb2 / (1 + mb1 + mb2) pr0 <- 1 - (pr1 + pr2) pr <- cbind(pr0, pr1, pr2) rm(mb1, mb2, pr0, pr1, pr2) outcomes <- t(apply(pr, 1, rmultinom, n = 1, size = 1000) ) outcomes <- as.data.frame(outcomes) colnames(outcomes) <- c("outcome_0", "outcome_1", "outcome_2") df <- cbind(outcomes, df) dfs <- list( df = df %>% group_by(idMun) %>% mutate(isTreat = max(treat), Z = ifelse(isTreat == 0, NA, Z)) %>% select(-isTreat) %>% ungroup() ) dfs$dfMun <- dfMun %>% select(-Z) %>% inner_join(dfs$df %>% select(idMun, Z) %>% distinct(), by = c("idMun" = "idMun")) dfs$df <- dfs$df %>% rename(j = idMun) dfs$dfMun <- dfs$dfMun %>% rename(j = idMun) %>% as_tibble() df <- dfs$df dfMun <- dfs$dfMun trueParams <- c(gamma, beta1, beta2) rm(dfs, mat, matMun, outcomes, pr, beta1, beta2, gamma, Z)
# true parameter values trueParams # true types tibble(idMun = dfMun$j, Z = dfMun$Z, trueZ = trueZ) # individual-level data.frame df # municipality-level data.frame dfMun
We may estimate one or several chains. Note that the package supports a parallel implementation using the furrr
package.
mod1 <- gibbs( formula = cbind(outcome_0, outcome_1, outcome_2) ~ state + year + treat*Z, data = df, formulaMun = Z ~ state, dataMun = dfMun, munCol = j, zCol = Z, yearCol = year, # index columns for municipalities, type (z), and year nSamples = 1000, # number of samples per chain nChains = 1, parallel = F ) mod2 <- gibbs( formula = cbind(outcome_0, outcome_1, outcome_2) ~ state + year + treat*Z, data = df, formulaMun = Z ~ state, dataMun = dfMun, munCol = j, zCol = Z, yearCol = year, nSamples = 1000, nChains = 2, parallel = F )
The usual R
methods are implemented, with a few caveats:
dplyr
gBurn
and gThin
.summary(mod1) summary(mod2, starts_with("gamma-"), burn = 500)
pl <- plot(mod2, what = "coef", data = T) pl <- bind_rows( pl %>% mutate(type = "estimate"), enframe(trueParams) %>% rename(predictor = name, estimate = value) %>% mutate(type = "true") ) coefPlot <- function(df) { ggplot(df, aes( x = predictor, y = estimate, color = type, ymin = `2.5%`, ymax = `97.5%` )) + geom_point() + geom_linerange() + geom_linerange(aes(ymin = `5%`, ymax = `95%`), lwd = 1) + coord_flip() + labs(title = "Estimates vs. true values") } coefPlot(pl %>% filter(grepl("gamma-", predictor))) coefPlot(pl %>% filter(grepl("beta", predictor)))
posterior_probs(mod2) confusion_table(mod2)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.