Accounting for Weights in the Model

## Setup
options(scipen=1, digits=2)

In some cases the response variable may be an average of responses from within the same subject or group. In this case, the variances for two observations having the same covariates will not be identical if the size of the groups vary. Instead, it will be inversely proportional to the size of the group. Mathematically, if the observed response is the group average $\bar{Y}i=\sum{j=1}^{n_i} Y_{ij}/n_i$ and $Y_{ij} \sim N(\mu_i,\phi_i)$ where the dispersion parameter, $\phi_i$, may vary by group and depend on covariates then $$\bar{Y}_i \sim N(\mu_i,\phi_i/n_i).$$ This situation can be accommodated in dalmatian by specifying a column of weights within the data frame that provides the group size. Here is an example with simulated data that demonstrates this feature and shows the importance of properly accounting for weights.

Simulated data

Data for this example are contained in the data frame weights.df saved within the file weights-data-1.RData.

## Load library

## Load simulated data

The three columns in the data frame record the number of responses per group (n), the value of the covariate (x), and the mean response (y). The data were generated from the model $$Y_{ij} \sim N(x_i,\exp(1+1x_i))$$ with $i=1,\ldots,100$ indexing the groups and $j$ the observations within groups. In the data, the number of observations per group ranges from r min(weights_data_1$n) to r max(weights_data_1$n).

Model 1: No Weights

First we run the model with no weights.

## Mean model
mymean=list(fixed=list(name="alpha",formula=~ x,

## Variance model
                       formula=~ x,

## Set working directory
## By default uses a system temp directory. You probably want to change this.
workingDir <- tempdir()

## Define list of arguments for jags.model()
jm.args = list(file=file.path(workingDir,"weights_1_jags.R"),n.adapt=1000)

## Define list of arguments for coda.samples()
cs.args = list(n.iter = 5000, n.thin = 20)

## Run the model using dalmatian
results1 <- dalmatian(
  df = weights_data_1,
  mean.model = mymean,
  dispersion.model = mydisp,
  jags.model.args = jm.args,
  coda.samples.args = cs.args,
  response = "y",
  overwrite = TRUE,
  debug = FALSE)

## Numerical summary statistics
summary1 <- summary(results1)

## Graphical summaries
caterpillar1 <- caterpillar(results1, show = TRUE)
## Extract results
mean1 <- summary1$dispFixed[1,"Mean"]
lower1 <- summary1$dispFixed[1,"Lower 95%"]
upper1 <- summary1$dispFixed[1,"Upper 95%"]

From the summaries we can see that that the intercept in the dispersion model is being underestimated. The true value is 1, but the posterior mean is r mean1 with 95% HPD interval (r lower1,r upper1).

Model 2: Weights

We now re-run the model including the weights.

## Specify column containing weights
mydisp$weights = "n"

## Run model
jm.args = list(file=file.path(workingDir,"weights_2_jags.R"),n.adapt=1000)

results2 = dalmatian(df=weights_data_1,
                     overwrite = TRUE,

## Numerical summary statistics
summary2 <- summary(results2)

## Graphical summaries
caterpillar2 <- caterpillar(results2, show = TRUE)
## Extract results
mean2 <- summary2$dispFixed[1,"Mean"]
lower2 <- summary2$dispFixed[1,"Lower 95%"]
upper2 <- summary2$dispFixed[1,"Upper 95%"]

The new output shows that the estimate of the intercept for the variance model, r mean2, is now very close to the truth and the 95% credible interval, (r lower2,r upper2) easily covers the true value of 1.

Try the dalmatian package in your browser

Any scripts or data that you put into this service are public.

dalmatian documentation built on Sept. 14, 2020, 1:07 a.m.