mml: Marginal Maximum Likelihood Estimation of Linear Models
In American-Institutes-for-Research/DE: Direct Estimation

Description Usage Arguments Details Value Author(s) Examples

View source: R/de.R

Implements a survey-weighted marginal maximum estimation, a type of regression where the outcome is a latent trait (such as student ability. Instead of using an estimate, the likelihood function marginalizes student ability. Includes a variety of variance estimation strategies.

mml(formula, stuItems, stuDat, paramTab, Q = 30, polyModel = c("GPCM",
  "GRM"), regType = c("regression", "popMean"), weightvar = NULL,
  control = list(), idVar = c(), missingCode = 8,
  missingValue = "c", multiCore = FALSE, bobyqaControl = list)

`formula`	a formula object in the style of `lm`
`stuItems`	a list where each element is named a student ID and contains a `data.frame`; see Details for the format
`stuDat`	a `data.frame` with a single row per student. Predictors in the formula must be in `stuDat`.
`paramTab`	a `data.frame` with columns shown in Details
`Q`	the number of integration points
`polyModel`	polytomous response model; one of `GPCM` for the Graded Partial Credit Model or `GRM` for the Graded Response Model
`regType`	one of `regression` or `popMean` where the latter estimates a population level mean
`weightvar`	a variable name on `stuDat` that is the full sample weight
`control`	a list with four elements that control the fitting process. See Details.
`idVar`	a variable name on `stuDat` that is the identifier. Every ID from `stuDat` must appear on `stuItems` and vice versa.
`missingCode`	the value a score is set to that indicates the item is missing. An item scored as `NA` will be ignored. The `missingCode` argument allows the user to recode scores to `missingValue`. This argument applies exclusively to binomial items.
`missingValue`	the value to set items scored as `missingCode`. When set to a number, that value is used for all items. When set to “`C`”, then the guessing parameter is used.
`multiCore`	allows the `foreach` package to be used. You should have already called `registerDoParallel`.
`bobyqaControl`	a list that gets passed to `bobyqa`

The mml function models a latent outcome conditioning on student item response data, student covariate data, and item parameter information; these three parts are broken up into three arguments. Student item response data go into stuItems, whereas student covariates, weights, and sampling information go into stuDat. The paramTab contains item parameter information for each item—the result of a separate item parameter scaling. In the case of the National Assessment of Educational Progress (NAEP), they can be found online, for example, at https://nces.ed.gov/nationsreportcard/tdw/analysis/scaling_irt.aspx. The model for dichotomous responses data is by default three Parameter Logit (3PL), unless the item parameter information provided by users suggests otherwise. For example, if the scaling used a two Parameter Logit (2PL) model, then the guessing parameter can simply be set to zero. For polytomous responses data, the model is dictated by the polyModel argument.

Student data are broken up into two parts. The item response data goes into stuItems ,and the student covariates for the formula go into stuDat. Information about items, such as item difficulties, is in paramTab. All dichotomous items are assumed to be 3PL, though by setting the guessing parameter to zero, the user can use a 2PL or the one Parameter Logit (1PL) or Rasch models. The model for polytomous responses data is dictated by the polyModel argument.

The marginal maximum likelihood then integrates the product of the student ability from the assessment data, and the estimate from the linear model estimates each student's ability based on the formula provided and a residual standard error term. This integration happens from the minimum node to the maximum node in the control argument (described later in this section) with Q quadrature points.

The stuItems argument has the scored student data. It is a list where each element is named with student ID and contains a data.frame with at least two columns. The first required column is named key and shows the item name as it appears in paramTab; the second column in named score and shows the score for that item. For binomial items, the score is 0 or 1. For GPCM items, the scores start at zero as well. For GRM, the scores start at 1.

The paramTab argument is a data.frame with a column named ItemID that agrees with the key column in the stuItems argument, and, for a 3PL item, columns P0, P1, and P2 for the “a”, “d”, and “g” parameters, respectively; see the vignette for details of the 3PL model. For a GPCM model, P0 is the “a” parameter, and the other columns are the “d” parameters; see the vignette for details of the GPCM model.

The control argument is a list with, optional, items D, the scale parameter, that defaults to 1.7; startVal, which is the starting value for the coefficients; and min.node and max.node, which sets the range of nodes for all students; these default to -4 and 4, respectively. The quadrature points then are a range from min.node to max.node with a total of Q nodes.

object of class mml.means. This is a list with elements:

`call`	the call used to generate this `mml.means` object
`coefficients`	the marginal maximum likelihood regression coefficients, including the estimated residual standard error
`LogLik`	the log-likelihood of the fit model
`X`	the design matrix of the marginal maximum likelihood regression
`Convergence`	a convergence note from the `bobyqa` optimizer
`location`	used for scaling the estimates
`scale`	used for scaling the estimates
`lnlf`	the likelihood function
`rr1`	the density function of each individual, conditional only on item responses in `stuItems`
`stuDat`	the `stuDat` argument
`weightvar`	the weight variable
`nodes`	the nodes the likelihood was evaluated on
`iterations`	the number of iterations required to reach convergence
`obs`	the number of observations used

Harold Doran, Paul Bailey, Claire Kelley, and Sun-joo Lee

## Not run: 
# get NAEP Primer data
require(EdSurvey)

# data
sdf <- readNAEP(system.file("extdata/data", "M36NT2PM.dat", package = "NAEPprimer"))
cols <- c("m066401", "m093701", "m086001", "m051901", "m067801", "m046501",
          "origwt", "repgrp1", "jkunit", "dsex")
data <- getData(sdf, varnames=cols, addAttributes=TRUE,
                omittedLevels=FALSE, defaultConditions=FALSE,
                returnJKreplicates=FALSE)

# 3PL items only:
# P0 is the discrimination parameter (a),
# P1 is the item difficulty (d),
# P2 is the guessing parameter (g) 
# polytomous responses could use P3-P10 for more difficulties
paramTab <- structure(list(ItemID = c("m066401", "m093701", "m086001",
                                      "m051901", "m067801", "m046501"),
                           P0 = c(0.68, 1.22, 1.05, 1.6, 0.86, 1.03),
                           P1 = c(-0.33, 1.81, 1, 0.61, -1.61, -0.14),
                           P2 = c(0.15, 0.17, 0.22, 0.08, 0.06, 0.37),
                           P3 = c(NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, NA_real_),
                           P4 = c(NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, NA_real_),
                           P5 = c(NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, NA_real_),
                           P6 = c(NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, NA_real_),
                           P7 = c(NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, NA_real_),
                           P8 = c(NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, NA_real_),
                           P9 = c(NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, NA_real_),
                           P10 = c(NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, NA_real_),
                           ScorePoints = c(1, 1, 1, 1, 1, 1),
                           MODEL = c("3pl", "3pl", "3pl", "3pl", "3pl", "3pl")),
                      row.names = c(1L, 3L, 4L, 5L, 9L, 13L),
                      class = "data.frame", location = 277.1563, scale = 37.7297)
# scores an item as correct if it contains an asterisk and as skipped if it
# is "Omitted", "Not Reached", or "Multiple". The value NA is left as NA.
# this score function is intended to be simple not reflect typical NAEP scoring.
simpleScore <- function(col) {
  score0 <- 0+grepl("*", col, fixed=TRUE)
  score1 <- ifelse(col %in% c("Omitted", "Not Reached", "Multiple"), 8, score0)
  score2 <- ifelse(col %in% NA, NA, score1)
  return(score2)
}

# score each item in paramTab
for(name in paramTab$ItemID){
  # show score output vs input data
  print(table(sdf[,name], simpleScore(sdf[,name]), useNA="ifany"))
  # score item
  data[,name] <- simpleScore(data[,name])  
}

# make stuItems 
data$id <- 1:nrow(data)
# first make a long data.frame of the item score data
stuItems <- reshape(data=data, varying=c(paramTab$ItemID), idvar=c("id"),
                    direction="long", v.names="score", times=paramTab$ItemID,
                    timevar="key")[,c("id", "key", "score")]
# then break it up into a single data.frame per student
stuItems <- split(stuItems, "id")

# Studat is the student covariates, weights, and sampling information
# used for variance estimation
stuDat <- data[, c('origwt', 'repgrp1', 'jkunit', 'dsex', 'id')]

### MML call 
mml1 <- mml(~dsex, stuItems=stuItems, 
            stuDat=stuDat, paramTab=paramTab, 
            regType = 'regression', Q=34, idVar="id", weightvar = "origwt")

# summary, assumes the sample was drawn IID
summary(mml1)
# summary, accounts for correlation between students in the same schools
summary(mml1, varType="Taylor", stratavar="repgrp1", psuvar="jkunit")

## End(Not run)