Basic unit-level models
In mcmcsae: Markov Chain Monte Carlo Small Area Estimation

options(width=100)  # width of output

The basic unit-level model [@Battese1988; @rao2015small] is given by $$ y_j = \beta' x_j + v_{i[j]} + \epsilon_j\,,\ v_i \stackrel{\mathrm{iid}}{\sim} {\cal N}(0, \sigma_v^2) \qquad \epsilon_j \stackrel{\mathrm{iid}}{\sim} {\cal N}(0, \sigma^2) $$ where $j$ runs from 1 to $n$, the number of unit-level observations, $\beta$ is a vector of regression coefficients for given covariates $x_j$, and $v_i$ are random area intercepts.

We use the api dataset included in packages survey.

library(survey)
data(api)
apipop$cname <- as.factor(apipop$cname)
apisrs$cname <- factor(apisrs$cname, levels=levels(apipop$cname))

The apipop data frame contains the complete population whereas apisrs is a simple random sample from it. The variable cname is the county name, and we will be interested in estimation at the county level. Not all counties in the population are sampled. In order to be able to make predictions for out-of-sample areas we make sure that the levels of the sample's cname variable match those of its population counterpart.

The basic unit-level model with county random area effects is fit as follows

library(mcmcsae)
mod <- api00 ~ 
  reg(~ ell + meals + stype + hsg + col.grad + grad.sch, name="beta") +
  gen(factor = ~ iid(cname), name="v")
sampler <- create_sampler(mod, data=apisrs)
sim <- MCMCsim(sampler, store.all=TRUE, verbose=FALSE)
summary(sim)

We wish to estimate the area population means $$ \theta_i = \frac{1}{N_i}\sum_{j \in U_i} y_j\,, $$ where $U_i$ is the set of units in area $i$ of size $N_i$. The MCMC output in variable sim can be used to obtain draws from the posterior distribution for $\theta_i$. The $r$th draw can be expressed as $$ \theta_{i;r} = \frac{1}{N_i} \left(n_i \bar{y}i + \beta_r'(t{x;i} - n_i \bar{x}i) + (N_i - n_i)v{i;r} + \sum_{j \in U_i\setminus s_i} \epsilon_{j;r} \right)\,, $$ where $\bar{y}i$ is the sample mean of $y$ in area $i$ and $t{x;i}$ is a vector of population totals for area $i$.

N <- table(apipop$cname)
samplesums <- tapply(apisrs$api00, apisrs$cname, sum)
samplesums[is.na(samplesums)] <- 0  # substitute 0 for out-of-sample areas
m <- match(apisrs$cds, apipop$cds)  # population units in the sample
res <- predict(sim, newdata=apipop, labels=names(N),
         fun=function(x, p) (samplesums + tapply(x[-m], apipop$cname[-m], sum ))/N,
         show.progress=FALSE)
(summ <- summary(res))
theta <- c(tapply(apipop$api00, apipop$cname, mean))  # true population quantities
plot_coef(summ, list(est=theta), n.se=2, est.names=c("mcmcsae", "true"), maxrows=30)

Binomial Unit-Level Model

A model with binomial likelihood can also be fit. We now model the target variable sch.wide, a binary variable indicating whether a school-wide growth target has been met. We use the same mean model structure as above for the linear model, but now using a logistic link function, $$ y_j \stackrel{\mathrm{iid}}{\sim} {\cal Be}(p_j)\,,\ \mathrm{logit}(p_j) = \beta' x_j + v_{i[j]}\,,\ v_i \stackrel{\mathrm{iid}}{\sim} {\cal N}(0, \sigma_v^2) $$

apisrs$target.met <- as.numeric(apisrs$sch.wide == "Yes")
sampler <- create_sampler(update(mod, target.met ~ .), family="binomial", data=apisrs)
sim <- MCMCsim(sampler, store.all=TRUE, verbose=FALSE)
summary(sim)

To predict the population fractions of schools that meet the growth target by county,

samplesums <- tapply(apisrs$target.met, apisrs$cname, sum)
samplesums[is.na(samplesums)] <- 0  # substitute 0 for out-of-sample areas
res <- predict(sim, newdata=apipop, labels=names(N),
         fun=function(x, p) (samplesums + tapply(x[-m], apipop$cname[-m], sum ))/N,
         show.progress=FALSE)
(summ <- summary(res))
theta <- c(tapply(apipop$sch.wide == "Yes", apipop$cname, mean))  # true population quantities
plot_coef(summ, list(est=theta), n.se=2, est.names=c("mcmcsae", "true"), maxrows=30)