This document describes Marginal Maximum Likelihood (MML) estimation for student test data in the DE
package. In these models, failing to account for the measurement variance can bais the regression or variance estimates.
The student test data is assumed to have been generated by an Item Response Theory (IRT) model where students responses are correct or incorrect (have an increasing score) when student $i$ has higher ability ($\theta_i$) and decreasing when the item is more difficult--see the appendix for the likelihood functions for various response models.
This package regards a regression model is of the form (one case in Cohen and Jiang 1999) \begin{align} \bm{\theta} = \bm{X\beta} + \bm{\epsilon} \end{align} where $\bm{\theta}$ is a vector of test scores, $\bm{X}$ is a matrix of covariates with unknown paramaters $\bm{\beta}$ and residual variance $\bm{\epsilon}$. The residual variance is assume to be normally distributed without covariance across observations (students) shared variance of unknown level $\sigma^2$ so that \begin{align} \bm{\epsilon} \sim N(0,\sigma^2 \bm{I}) \end{align}
The next section describes the estimation of $\bm{\beta}$ and $\sigma^2$. The final section describes five methods for variance estimation available in MML estimation, including the traditional (consistent) method, two heteroskedasticity robust methods, and two methods appropriate to a two-stage survey sample such as the National Assessment of Education Progress (NAEP).
Student test data\footnote{Note that these methods equivalently apply to survey construct data that is scored in the same way.} consist of a series of items on which a student receives a score. The matrix $\bm{R}$ has row $i$ regarding a student and column $j$ regarding an item so that $R_{ij}$ is student $i$'s score on item $j$ and takes on integer values from 0 to the maximum score on the item. There are many possible models for the $\bm{R}$ matrix data; those are not covered in this document, which assumes that the item parameters have been estimated with a consistent estimator and are treated as being estimated without error.
In an MML model for test data for $N$ individuals, conditional on a set of parameters for a set of $K$ test items, the likelihood of a regression equation is \begin{align} \mathcal{L} (\bm{\beta}, \sigma|\bm{w}, \bm{R}, \bm{X}, \bm{P}) = \prod_{i=1}^N \left[ \int_{-\infty}^{\infty} \frac{1}{\sigma \sqrt{2\pi}} \exp \frac{-(\theta_i - \bm{X}i \beta)^2}{2\sigma^2} \, \prod{j=1}^K {\rm Pr}(\bm{R}{ij}|\theta_i,\bm{P}_j) d\theta_i \right]^{\bm{w}_i} \end{align} where $\mathcal{L}$ is the likelihood\footnote{When survey weights ae applied the likelihoods in this document are all pseudo-likelihoods.} of the regression parameters $\bm{\beta}$ with full sample weights $\bm{w}_i$ conditional on item score matrix $\bm{R}$, student covariate matrix $\bm{X}$, and item parameter data $\bm{P}$; $\sigma^2$ is the variance of the regression residual; $\theta_i$ is the $i$th student's latent ability measure which is being integrated out, ${\rm Pr}(\bm{R}{ij}|\theta_i, \bm{P}j)$ is the probability of individual $i$s score on test item $j$, conditional on the student's ability and item parameters $\bm{P}_j$--see the appendix for example forms of $\rm{ Pr}(\bm{R}{ij}|\theta_i, \bm{P}_j)$. In a special case $\bm{X}$ is a vector of all ones and the value of $\bm{\beta}$ is the scalar (weighted) mean.
The integral is evaluated using the trapezoid rule\footnote{The trapeziod rule is generally $o(h^2)$, for students where the likelihood is essentially zero at the edges it is $o(h^4)$ because the functions is periodic and the $o(h^2)$ terms cancel.} at quadrature points $t_q$ and quadrature weights $\delta$ so that \begin{align} \mathcal{L} (\bm{\beta}, \sigma|\bm{w}, \bm{R}, \bm{X}, \bm{P}) &= \prod_{i=1}^N \left[ \sum_{q=1}^Q \delta \frac{1}{\sigma \sqrt{2\pi}} \exp \frac{-(t_q - \bm{X}i \bm{\beta})^2}{2\sigma^2} \prod{j=1}^K {\rm Pr}(\bm{R}{ij}|t_q, \bm{P}_j)\right]^{\bm{w}_i} \end{align} where $\delta$ is the distance between any two uniformly spaced quadrature points so that $\delta = t{q+1} - t_{q}$ for any $q$ that is at least one and less than $Q$. The range and value of $Q$ parameterize the quadrature and its accuracy and should be varied to assure convergence. The advantage of the trapezoidal rule is that the fixed quadrature points allow the values of the probability to be calculated once per student.
The variance formulas use the loglikelihood. That is given by \begin{align} \ell (\bm{\beta}, \sigma|\bm{w}, \bm{R}, \bm{X}, \bm{P}) &= \sum_{i=1}^N \bm{w}i \, {\rm log} \left[\delta \sum{q=1}^Q \frac{1}{\sigma \sqrt{2\pi}} \exp \frac{-(t_q - \bm{X}i \bm{\beta})^2}{2\sigma^2} \prod{j=1}^K {\rm Pr}(\bm{R}_{ij}|t_q, \bm{P}_j) \right] \end{align} Note that $\delta$ can be removed for optimization and its presence adds ${\rm log}(\delta) \sum \bm{w}_i$ to the loglikelihood.
There are several ways to estimate the variance of the parameters $\bm{\beta}$.\footnote{Strictly speaking $\sigma^2$ is also a parameter, but we are rarely interested in the variance of the variance. Nevertheless, the package generates an estimate of $\sigma^2$ along with the coefficients themselves. For notational simplicity, all formulas ignore this.}
The inverse Hessian matrix is a consistent estimator when the estimator of $\bm{\beta}$ is consistent (Green, 2003; p. 520)
\begin{align}
{\rm Var}(\bm{\beta}) = -\bm{H}(\bm{\beta})^{-1} = - \left[\frac{\partial^2 \ell(\bm{\beta}, \sigma|\bm{w}, \bm{R}, \bm{X})}{\partial \bm{\beta}^2} \right]^{-1}
\end{align}
This is the variance returned when the variance method is set to consistent
or left as the default.
A class of variance estimators that are typically called "sandwich" or "robust" variance estimators allow for variation in the residual and are of the form \begin{align} {\rm Var}(\bm{\beta}) = H(\bm{\beta})^{-1} \bm{V} H(\bm{\beta})^{-1} \label{eq:sandwich} \end{align} where $V$ is an estimate of the variance of the summed score function (Binder, 1983).
For a convenience sample there are two robust estimators we provide. First, the so called robust
(Huber, or Huber-White) variance estimator uses
\begin{align}
\bm{V} &= \sum_{i=1}^N \left[ \frac{\partial \ell(\beta, \sigma|\bm{w}_i, \bm{R}_i, \bm{X}_i)}{\partial \beta} \right] \left[ \frac{\partial \ell(\beta, \sigma|\bm{w}_i, \bm{R}_i, \bm{X}_i)}{\partial \beta} \right]^{'}
\end{align}
Second, for the cluster robust
case the partial derivatives are summed within the cluster so that
\begin{align}
\bm{V} &= \sum_{c=1}^{n^\prime} \left[ \frac{\partial \ell(\beta, \sigma|\bm{w}c, \bm{R}_c, \bm{X}_c)}{\partial \beta} \right] \left[ \frac{\partial \ell(\beta, \sigma|\bm{w}_c, \bm{R}_c, \bm{X}_c)}{\partial \beta} \right]^{'}
\end{align}
where there are $n^\prime$ clusters, indexed by $c$, and the partial derivatives are summed within the group of which there are $n_c$ members
\begin{align}
\frac{\partial \ell(\beta, \sigma|\bm{w}_c, \bm{R}_c, \bm{X}_c)}{\partial \beta} &= \sum{i=1}^{n_c} \frac{\partial \ell(\beta, \sigma|\bm{w}_i, \bm{R}_i, \bm{X}_i)}{\partial \beta}
\end{align}
We also provide two survey sampling variance estimation techniques. The first one uses replicate weights, either from jackknife, including Fay's method for the jackknife, or from balanced repeated replication.
In this approach the typical method of estimating sampling variance still works, and the sampling covariance matrix can be calculated as
\begin{align}
{\rm Var}(\bm{\beta}) &= \sum_{j=1}^J \left(\bm{\beta}_j - \bm{\beta}_0 \right) \left(\bm{\beta}_j - \bm{\beta}_0 \right)^{'}
\end{align}
where there are $J$ replicate weights and the result of applying direct estimation under the set of weights $j$ is $\bm{\beta}_j$,
while $\bm{\beta}_0$ is the estimate of $\bm{\beta}$ under the full sample weights. This method is used when replicate
variance estimation is requested.
The second survey sampling method is called the Taylor series
method and uses the same formula as eq. \ref{eq:sandwich} but $\bm{V}$ is the estimate of the variance of the score vector (Binder, 1983). Our implementation assumes a two-stage design with $n_a$ Primary Sampling Units (PSUs) in stratum $a$ and is summed up across the $A$ strata according to
\begin{align}
\bm{V} &= \sum_{a=1}^A \bm{V}a
\end{align}
where $\bm{V}_a$ is a variance estimate for stratum $a$ and is defined by
\begin{align}
\bm{V}_a &= \frac{n_a}{n_a -1} \sum{p=1}^{n_a} \left( \bm{s}p - \bar{\bm{s}}_a \right)\left( \bm{s}_p - \bar{\bm{s}}_a \right)' \label{eq:Va}
\end{align}
where $s_p$ is the sum of the weighted (or pseudo-) score vector that includes all units in PSU $p$ in stratum $a$ and $\bar{\bm{s}}_a$ is the (unweighted) mean of the $\bm{s}_p$ terms in stratum $a$ so that
\begin{align}
s_p &=\sum{i \in {\rm PSU} \ p}\frac{\partial \ell(\beta, \sigma|\bm{w}i, \bm{R}_i, \bm{X}_i)}{\partial \beta} & \bar{\bm{s}}_a&= \frac{1}{n_a} \sum{p \in {\rm stratum} \ a} s_p
\end{align}
When a stratum has only one PSU $\bm{V}_a$ is undefined. The best approach is for the analyst to adjust the strata and PSU identifiers, in a manner consistent with the sampling approach, so that there are not singleton strata. Two simpler, but less defensible, options are available. First, the strata with single PSUs can be dropped from the variance estimation.
The second option is for the singleton strata to use the overall mean of $s_p$ in place of $\bar{s}_a$. So \begin{align} \bar{\bm{s}} &= \frac{1}{n^\prime} \sum s_p \end{align} where the sum is across all PSUs and $n^\prime$ is the number of PSUs across all strata. Then, for each singleton stratum, eq. \ref{eq:Va} becomes \begin{align} \bm{V}_a &= \left( \bm{s}_p - \bar{\bm{s}} \right)\left( \bm{s}_p - \bar{\bm{s}} \right)' \label{eq:Va1} \end{align} where the value 1 is used in place of $\frac{n_a}{n_a-1}$ and so does not appear in eq. \ref{eq:Va1}.
Binder, D. A. (1983). On the variances of asymptotically normal estimators from complex surveys. International Statistical Review, 51(3), 279--292.
Cohen, J. D.; and Jiang, Tao (1999). Comparison of partially measured latent traits across nominal subgroups. Journal of the America Statistical Association, 94(448), 1035--1044.
Green , W. H. (2003). Econometric Analysis. Prentice Hall: Upper Saddle River, NJ.
Huber, P. J. (1967) The behavior of maximum likelihood estimates under nonstandard conditions. Proceedings of the Fifth Berkeley Symposium of Mathematical Statistics and Probability Vol. I: Statistics pp. 221--233 University of California Press, Berkeley, Calif.
Lord, F. M.; and Novick, M. R. (1968). Statistical Theories of Mental Test Scores, Reading, MA: Addison-Wesley.
McCullagh, P.; and Nelder, J. A. (1989). Generalized Linear Models. Second edition. Chapman & Hall/CRC: London.
White, H. (1980). A heteroskedasticity-consistent covariance matrix estimator and a direct test for heteroskedasticity. Econometrica, 48(4), 817--838.
NAEP (2008). The Generalized Partial Credit Model. NAEP Technical Documentation Website. Retrieved from https://nces.ed.gov/nationsreportcard/tdw/analysis/scaling_models_gen.aspx on December 24, 2018.
For all cases scored as either correct or incorrect, we use the three parameter logit (3PL) model. \begin{align} \Pr(\bm{R}_{ij}|\theta_i, \bm{P}_j) &= c_j + \frac{1-g_j}{1+\exp\left[ -D \, a_j \, (\theta_i - d_j)\right]} \label{eq:3PL} \end{align} where $g_j$ is the guessing parameter, $a_j$ is the discrimination factor, $d_j$ is the item difficulty, and $D$ is a constant, usually set to 1.7, to map the $\theta_i$ and $d_j$ terms to a probit-like space; this term is applied by tradition.
When a two parameter logit (2PL) is used eq. \ref{eq:3PL} is modified to omit $g_j$ (effectively setting it to zero) \begin{align} \Pr(\bm{R}_{ij}|\theta_i, \bm{P}_j) &= \frac{1}{1+\exp\left[ -D \, a_j \, (\theta_i - d_j)\right]} \label{eq:2PL} \end{align}
When a Rasch model is used eq. \ref{eq:2PL} is further modified to set all $a_j$ to a single $a$, while $D$ is set to one. \begin{align} \Pr(\bm{R}_{ij}|\theta_i, \bm{P}_j) &= \frac{1}{1+\exp\left[ - a \, (\theta_i - d_j)\right]} \label{eq:Rasch} \end{align}
The Graded Response Model (GRM) has probability density that generalizes an ordered logit (McCullagh and Nelder, 1989) \begin{align} \Pr(\bm{R}{ij}|\theta_i, \bm{P}_j) &= \frac{1}{1+\exp\left[-D\, a_j \, (\theta_i - d{R_{ij},j})\right]} - \frac{1}{1+\exp\left[-D\, a_j \, (\theta_i - d_{1+R_{ij},j})\right]} \label{eq:grm} \end{align} Here the parameters $\bm{P}j$ are the cut points $d{cj}$ where $d_{0j}=-\infty$ and $d_{C+1,j}=\infty$. In the first term on the right side of eq. \ref{eq:grm} the subscript $R_{ij}$ on $d_{R_{ij},j}$ indicates it is the cut point associated with the response level to item $j$ for person $i$, while the last subscript ($j$) indicates that it is the $d$ term for item $j$. In the second term the cut point above that cut point is used.
The Generalized Partial Credit Model (GPCM) has probability density that generalizes a multinomial logit (McCullagh and Nelder, 1989) \begin{align} \Pr(\bm{R}{ij}|\theta_i, \bm{P}_j) &= \frac{\exp \left[ \sum{c=0}^{R_{ij}} D a_j (\theta_i - d_{cj}))) \right]}{\sum_{r=0}^{C} \exp \left[ \sum_{c=0}^{r} D a_j (\theta_i - d_{cj}))) \right]} \end{align} where $c$ indexes cut points, of which there are $C$, and $j$ indexes the item.
There is an indeterminacy in the GPCM equation because all $d_j$ terms could increase and make the values of the probability the same. The indeterminacy can be solved in several ways; we describe three. First, it can be solved by setting the value for the $d_{j0}=1$.
Second, the indeterminacy can be solved by rewriting the probability so the first term is $d_{j0}=\theta_i$ which leads to \begin{align} \Pr(\bm{R}{ij}|\theta_i, \bm{P}_j) &= \frac{\exp \left[ \sum{c=1}^{R_{ij}} D a_j (\theta_i - d_{cj}))) \right]}{1+\sum_{r=1}^{C} \exp \left[ \sum_{c=1}^{r} D a_j (\theta_i - d_{cj}))) \right]} \end{align} and the numerator is one when $\bm{R}_{ij}$ is zero.
Third, an additional mean difficulty ($b_j$) can be estimated, and the $d_j$ values are then given by \begin{align} d_{0j} &= b_j & d_{cj} &= b_j + \delta_{jc} \, ; \, 1 \leq c \leq C \end{align} where the $\delta_{jc}$ values are estimated so that $0=\sum_{c=1}^{C} \delta_{jc}$. This latter method is the method used by NAEP (NAEP, 2008).
When a Partial Credit Model (PCM) is used the value of $D$ is set to one while $a_j$ is again shared across all items and so \begin{align} \Pr(\bm{R}{ij}|\theta_i, \bm{P}_j) &= \frac{\exp \left[ \sum{c=0}^{R_{ij}} a (\theta_i - d_{cj}))) \right]}{\sum_{r=0}^{C} \exp \left[ \sum_{c=0}^{r} a (\theta_i - d_{cj}))) \right]} \end{align}
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.