library(magrittr)
library(dplyr)
library(tibble)
library(ggplot2)
library(rogali)
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)

Theory

Notation

Let $\mathcal{J}$ be a set of municipalities, and $\mathcal{I}$ a set of individuals. There are $\mathcal{T}$ time periods. The outcome of interest is $Y_{ijt} \in {0,1,2}$, with:

Define $Z_j \in {0,1}$ the type of municipality $j$. If $Z_j = 0$, municipality $j$ is not corrupt. Otherwise, $j$ is corrupt. Furthermore, define $T_{jt} \in {0,1}$ be municipality $j$'s treatment status at time $t$, with $T_{jt} = 1$ if municipality $j$ has been audited at some period $t' \leq t$, and $T_{jt} = 0$ otherwise. In other words, $T_{jt} = 0$ until $j$'s first audit, and $T_{jt} = 1$ from its first audit. Note that a municipality's type is observed if and only if it has been audited at least once. That is: $$ Z_j \text{ is observed} \iff \max_{t \in \mathcal{T}} T_{jt} = 1 $$ Finally, define $X_i$ a vector of time-invariant employee characteristics and $V_j$ a vector of time-invariant municipality characteristics.

Model

Suppose that $Z_j$ follows the following binomial: $$ \Pr(Z_j = 1 | U_j) = \text{logit} \left(V_j' \gamma \right) $$

Suppose that $Y_{ijt}$ follows the following multinomial: $$ \Pr(Y_{ijt} = k | Z_j, X_i) = \begin{cases} \frac{\exp(\beta_{kt} + \beta_{k0} Z_j + \beta_{k1} T_{jt} + \beta_{k2} Z_j T_{jt} + \beta_{k3} X'i)} {1 + \sum{k \in {1,2}} \exp(\beta_{kt} + \beta_{k0} Z_j + \beta_{k1} T_{jt} + \beta_{k2} Z_j T_{jt} + \gamma X'i)} \text{, if } k \in {1, 2} \ \frac{1} {1 + \sum{k \in {1,2}} \exp(\alpha_{kt} + \beta_{k0} Z_j + \beta_{k1} T_{jt} + \beta_{k2} Z_j T_{jt} + \beta_{k3} X'_i)} \text{, otherwise} \end{cases} $$

In other words, we use a multinomial with year fixed-effects $\beta_{kt}$ (which is equivalent to a discrete-time proportional hazard survival model) and an interaction term between treatment status $T_{jt}$ and the latent type $Z_j$.

Estimation

We estimate the model using a Gibbs sampler that uses the data-augmentation scheme developped by Polson, Scott and Windle (JASA, 2013). In what follows, let $N(\mu, \sigma^2)$ be a multivariate normal distribution with mean $\mu$ and variance $\sigma^2$, and $PG(a, b)$ be a Polya-Gamma distribution with parameters $a$ and $b$. In this section, we use a pooled notation for the multinomial. Let $Y_{xjt} \equiv (n_{0xjt}, n_{1xjt}, n_{2xjt})$ be a vector counting the number of individuals sharing covariates $x$ in municipality $j$ on year $t$ with outcomes $Y_{ijt} = 0, 1, 2$ respectively, and $n_{xjt} \equiv n_{0xjt} + n_{1xjt} + n_{2xjt}$ be the total number of individuals in this cell. The key to this sampler is to use a specific data-augmentation strategy. Denote $\omega_{\beta kxjt}$ a latent variable for the individuals with outcome $k$ in $y_{xjt}$, and $\omega_{\gamma j}$ a latent variable for municipality $j$.

Parameters have normal priors: $$ \gamma \sim N(\mu_\gamma, \sigma^2_\gamma) \ \beta_k \sim N(\mu_{\beta_k}, \sigma^2_{\beta_k}) $$

Our sampler is the following: $$ \omega_{\gamma j} | \omega_{\beta kxjt}, \gamma, \beta, Z, Y \sim PG(1, U_j' \gamma) \ \gamma | \omega_{\gamma}, \omega_{\beta}, \beta, Z, Y \sim N(m_\gamma, V_\gamma) \ \omega_{\beta kxjt} | \omega_{\gamma}, \gamma, \beta, Z, Y \sim PG(n_{xjt}, \eta_{kxjt}) \ \beta_k | \omega_{\gamma}, \omega_{\beta k}, \gamma, Z, Y \sim N(m_{\beta k}, V_{\beta k}) Z_j | \omega_{\gamma}, \omega_{\beta k}, \gamma, \beta, Y \sim \text{Binom}(\pi_j) $$

with
$$ V_\gamma = (U' \Omega_\gamma U + (\sigma^2_\gamma)^{-1})^{-1} \ m_\gamma = V_\gamma (U' \kappa_\gamma + (\sigma^2_\gamma)^{-1} \mu_\gamma) \ \kappa_\gamma = (Z_1 - 1/2, ..., Z_J - 1/2) \ V_{\beta k} = (X' \Omega_{\beta k} X + (\sigma^2_{\beta_k})^{-1})^{-1} \ m_{\beta k} = V_{\beta k} (X' (\kappa_{\beta k} + \Omega_{\beta k} C_k) + (\sigma^2_{\beta_k})^{-1} \mu_{\beta_k}) \ \pi_j = \frac{\Pr(Z_j = 1 | \gamma) \Pr(Y_j | Z_j = 1, \beta)} {\Pr(Z_j = 0 | \gamma) \Pr(Y_j | Z_j = 0, \beta) + \Pr(Z_j = 1 | \gamma) \Pr(Y_j | Z_j = 1, \beta)} $$

Note that $\Omega_v$ is a diagonal matrix with diagonal elements $\omega_{vw}$, and $$ \eta_{kxjt} = X_{xjt}' \beta_k - C_{kxjt} \text{ with } C_{kxjt} = \log \sum_{l \neq k} \exp X_{xjt}' \beta_l $$

Application

Simulating data

df <- 
  expand.grid(idMun = 1:100, year = 1:5) %>% 
  mutate(
    state = ceiling(idMun/10), 
    state = as.factor(state), 
    treat = as.integer((idMun %% 10) <= 5 & year == (idMun %% 10)), 
    year = as.factor(year)
  ) %>% arrange(idMun, year)

dfMun <- df %>% 
  select(idMun, state) %>% 
  distinct()

matMun <- model.matrix(~ state, dfMun)
gamma <- c(seq(0.2, 1, .1)-.5, .5)
names(gamma) <- paste0("gamma-", colnames(matMun))
Z <- plogis(matMun %*% gamma)
Z <- sapply(Z, rbinom, n = 1, size = 1)
dfMun$Z <- Z
trueZ <- Z

df <- df %>% inner_join(dfMun, by = c("idMun" = "idMun", "state" = "state")) 

mat <- model.matrix(~ state + year + treat*Z, df)
beta1 <- c(seq(0.1, 1, .1)-.5, -.2, -.1, .1, .2, 0, -.5, 1)  
beta2 <- beta1 - .3
names(beta1) <- names(beta2) <- colnames(mat)
names(beta1) <- paste0("beta1-", names(beta1))
names(beta2) <- paste0("beta2-", names(beta2))

mb1 <- exp(mat %*% beta1)
mb2 <- exp(mat %*% beta2)
pr1 <- mb1 / (1 + mb1 + mb2)
pr2 <- mb2 / (1 + mb1 + mb2)
pr0 <- 1 - (pr1 + pr2)
pr <- cbind(pr0, pr1, pr2)
rm(mb1, mb2, pr0, pr1, pr2)
outcomes <- t(apply(pr, 1, rmultinom, n = 1, size = 1000) )
outcomes <- as.data.frame(outcomes)
colnames(outcomes) <- c("outcome_0", "outcome_1", "outcome_2")
df <- cbind(outcomes, df)

dfs <- list(
  df = df %>% 
    group_by(idMun) %>% 
    mutate(isTreat = max(treat), Z = ifelse(isTreat == 0, NA, Z)) %>% 
    select(-isTreat) %>% 
    ungroup()
)
dfs$dfMun <- dfMun %>% 
  select(-Z) %>% 
  inner_join(dfs$df %>% select(idMun, Z) %>% distinct(), by = c("idMun" = "idMun"))
dfs$df <- dfs$df %>% rename(j = idMun)
dfs$dfMun <- dfs$dfMun %>% rename(j = idMun) %>% as_tibble()
df <- dfs$df
dfMun <- dfs$dfMun
trueParams <- c(gamma, beta1, beta2)
rm(dfs, mat, matMun, outcomes, pr, beta1, beta2, gamma, Z)
# true parameter values
trueParams
# true types
tibble(idMun = dfMun$j, Z = dfMun$Z, trueZ = trueZ)
# individual-level data.frame
df
# municipality-level data.frame
dfMun

Estimation

We may estimate one or several chains. Note that the package supports a parallel implementation using the furrr package.

mod1 <- gibbs(
  formula = cbind(outcome_0, outcome_1, outcome_2) ~ state + year + treat*Z, data = df, 
  formulaMun = Z ~ state, dataMun = dfMun, 
  munCol = j, zCol = Z, yearCol = year, # index columns for municipalities, type (z), and year
  nSamples = 1000, # number of samples per chain
  nChains = 1, parallel = F
)

mod2 <- gibbs(
  formula = cbind(outcome_0, outcome_1, outcome_2) ~ state + year + treat*Z, data = df, 
  formulaMun = Z ~ state, dataMun = dfMun, 
  munCol = j, zCol = Z, yearCol = year, 
  nSamples = 1000, nChains = 2, parallel = F
)

The usual R methods are implemented, with a few caveats:

summary(mod1)

summary(mod2, starts_with("gamma-"), burn = 500)
pl <- plot(mod2, what = "coef", data = T)
pl <- bind_rows(
  pl %>% mutate(type = "estimate"), 
  enframe(trueParams) %>% 
    rename(predictor = name, estimate = value) %>% 
    mutate(type = "true")
)
coefPlot <- function(df) {
  ggplot(df, aes(
    x = predictor, y = estimate, color = type, ymin = `2.5%`, ymax = `97.5%`
  )) + 
    geom_point() + 
    geom_linerange() + 
    geom_linerange(aes(ymin = `5%`, ymax = `95%`), lwd = 1) + 
    coord_flip() + 
    labs(title = "Estimates vs. true values")
}

coefPlot(pl %>% filter(grepl("gamma-", predictor)))
coefPlot(pl %>% filter(grepl("beta", predictor)))
posterior_probs(mod2)
confusion_table(mod2)


rferrali/rogali documentation built on May 26, 2019, 7 p.m.