#' Marginal Maximum Likelihood Estimation of Linear Models
#' @description
#' Implements a survey-weighted marginal maximum estimation, a type of
#' regression where the outcome is a latent trait (such as student ability.
#' Instead of using an estimate, the likelihood function marginalizes student
#' ability. Includes a variety of variance estimation strategies.
#'
#' @param formula a formula object in the style of \code{lm}
#' @param stuItems a list where each element is named a student ID and contains
#' a \code{data.frame}; see Details for the format
#' @param stuDat a \code{data.frame} with a single row per student. Predictors in
#' the formula must be in \code{stuDat}.
#' @param paramTab a \code{data.frame} with columns shown in Details
#' @param Q the number of integration points
#' @param polyModel polytomous response model;
#' one of \code{GPCM} for the Graded Partial Credit Model
#' or \code{GRM} for the Graded Response Model
#' @param regType one of \code{regression} or \code{popMean} where the latter
#' estimates a population level mean
#' @param weightvar a variable name on \code{stuDat} that is the full sample weight
#' @param control a list with four elements that control the fitting process. See Details.
#' @param idVar a variable name on \code{stuDat} that is the identifier. Every
#' ID from \code{stuDat} must appear on \code{stuItems} and vice versa.
#' @param missingCode the value a score is set to that indicates the item is missing.
#' An item scored as \code{NA} will be ignored. The \code{missingCode}
#' argument allows the user to recode scores to \code{missingValue}.
#' This argument applies exclusively to binomial items.
#' @param missingValue the value to set items scored as \code{missingCode}.
#' When set to a number, that value is used for all items.
#' When set to \dQuote{\code{C}}, then the guessing parameter
#' is used.
#' @param multiCore allows the \code{foreach} package to be used. You should
#' have already called
#' \ifelse{latex}{the \code{registerDoParallel} function in the \code{doParallel} package}{\code{\link[doParallel]{registerDoParallel}}}.
#' @param bobyqaControl a list that gets passed to \ifelse{latex}{the \code{bobyqa} optimizer in \code{minqa}}{\code{\link[minqa]{bobyqa}}}
#' @details
#'
#' The \code{mml} function models a latent outcome conditioning on student
#' item response data, student covariate data, and item parameter information;
#' these three parts are broken up into three arguments.
#' Student item response data go into \code{stuItems}, whereas student
#' covariates, weights, and sampling information go into \code{stuDat}.
#' The \code{paramTab}
#' contains item parameter information for each item---the result of a
#' separate item parameter scaling. In the case of
#' the National Assessment of Educational Progress (NAEP),
#' they can be found online, for example, at
#' \href{https://nces.ed.gov/nationsreportcard/tdw/analysis/scaling_irt.aspx}{https://nces.ed.gov/nationsreportcard/tdw/analysis/scaling_irt.aspx}.
#' The model for dichotomous responses data is by default three Parameter Logit
#' (3PL), unless the item parameter information provided by users suggests
#' otherwise. For example, if the scaling used a two Parameter Logit (2PL) model,
#' then the guessing parameter can simply be set to zero. For polytomous
#' responses data, the model is dictated by the \code{polyModel} argument.
#'
#' Student data are broken up into two parts. The item response data goes
#' into \code{stuItems} ,and the student covariates for the formula go into
#' \code{stuDat}. Information about items, such as item difficulties, is in
#' \code{paramTab}. All dichotomous items are assumed to be
#' 3PL, though by setting the guessing parameter to zero, the user
#' can use a 2PL or the one Parameter Logit (1PL) or Rasch models.
#' The model for polytomous responses data is dictated by the \code{polyModel}
#' argument.
#'
#' The marginal maximum likelihood then integrates the product of the student
#' ability from the assessment data, and the estimate from the linear model
#' estimates each student's ability based on the \code{formula} provided
#' and a residual standard error term. This integration happens from the
#' minimum node to the maximum node in the \code{control} argument (described
#' later in this section) with \code{Q} quadrature points.
#'
#' The \code{stuItems} argument has the scored student data. It is a list where
#' each element is named with student ID and contains
#' a \code{data.frame} with at least two columns.
#' The first required column is named
#' \code{key} and shows the item name as it appears in \code{paramTab};
#' the second column in named
#' \code{score} and shows the score for that item. For binomial
#' items, the \code{score} is 0 or 1. For \code{GPCM} items, the scores
#' start at zero as well. For \code{GRM}, the scores start at 1.
#'
#' The \code{paramTab} argument is a \code{data.frame} with a column named
#' \code{ItemID} that agrees with
#' the \code{key} column in the \code{stuItems} argument,
#' and, for a 3PL item, columns \code{P0},
#' \code{P1}, and \code{P2} for the \dQuote{a}, \dQuote{d}, and
#' \dQuote{g} parameters, respectively; see the vignette for details of
#' the 3PL model.
#' For a \code{GPCM} model, \code{P0} is the \dQuote{a} parameter, and the other
#' columns are the \dQuote{d} parameters; see the vignette for details
#' of the GPCM model.
#'
#' The \code{control} argument is a list with, optional, items \code{D}, the
#' scale parameter, that defaults to 1.7; \code{startVal}, which is the starting
#' value for the coefficients; and \code{min.node} and \code{max.node}, which
#' sets the range of nodes for all students; these default to
#' -4 and 4, respectively. The quadrature points then are a range
#' from \code{min.node} to \code{max.node} with a total of \code{Q} nodes.
#'
#' @return object of class \code{mml.means}.
#' This is a list with elements:
#' \item{call}{the call used to generate this \code{mml.means} object}
#' \item{coefficients}{the marginal maximum likelihood regression coefficients, including the estimated residual standard error}
#' \item{LogLik}{the log-likelihood of the fit model}
#' \item{X}{the design matrix of the marginal maximum likelihood regression}
#' \item{Convergence}{a convergence note from the \code{bobyqa} optimizer}
#' \item{location}{used for scaling the estimates}
#' \item{scale}{used for scaling the estimates}
#' \item{lnlf}{the likelihood function}
#' \item{rr1}{the density function of each individual, conditional only on item responses in \code{stuItems}}
#' \item{stuDat}{the \code{stuDat} argument}
#' \item{weightvar}{the weight variable}
#' \item{nodes}{the nodes the likelihood was evaluated on}
#' \item{iterations}{the number of iterations required to reach convergence}
#' \item{obs}{the number of observations used}
#'
#' @example \man\examples\de.R
#' @author Harold Doran, Paul Bailey, Claire Kelley, and Sun-joo Lee
#' @export
#' @importFrom minqa bobyqa
#' @importFrom stats as.formula model.matrix coef dbinom sd
#' @importFrom utils head
mml <- function(formula, stuItems, stuDat, paramTab, Q=30, polyModel=c('GPCM', 'GRM'),
regType=c('regression', 'popMean'),
weightvar=NULL,
control=list(),
idVar=c(),
missingCode=8,
missingValue="c",
multiCore=FALSE,
bobyqaControl=list) { # missing code from NAEP
call <- match.call()
polyModel <- match.arg(polyModel)
regType <- match.arg(regType)
polyModel <- tolower(polyModel)
if(!idVar %in% names(stuDat)) {
stop(paste0(dQuote("idVar"), " must be a variable on stuDat to confirm the ", dQuote("stuDat"), " rows agree with the ", dQuote("stuItems"), " list."))
}
if(!inherits(stuDat, "data.frame")) {
stop(paste0("Argument ", dQuote("stuDat"), " must be a data frame."))
} else {
if(length(class(stuDat))>1) {
# not all data.frames are the same, recast
stuDat <- as.data.frame(stuDat)
}
}
if(!inherits(paramTab, "data.frame")) {
stop(paste0("Argument ", dQuote("paramTab"), " must be a data frame."))
} else {
if(length(class(paramTab))>1) {
# not all data.frames are the same, recast
paramTab <- as.data.frame(paramTab)
}
}
regType <- match.arg(regType)
if(any(!names(stuItems) %in% stuDat[[idVar]])) {
missing <- names(stuItems)[!names(stuItems) %in% stuDat[[idVar]]]
stop(paste0("The ", dQuote("stuItems"), " argument must be a list with names that correspond to every " , dQuote("idVar"), " in ", dQuote("stuDat"), ". some missing IDs ", pasteItems(dQuote(head(missing,5))), "."))
}
if(any(!stuDat[[idVar]] %in% names(stuItems))) {
missing <- stuDat[[idVar]][!stuDat[[idVar]] %in% names(stuItems)]
stop(paste0("The ", dQuote("stuDat"), " argument must be a data.frame with a column ", dQuote("idVar"), " that correspond to every name of ", dQuote("stuItems"), ". some missing IDs ", pasteItems(dQuote(head(missing,5))), "." ))
}
# make sure stuItems and stuDat are in the same order
stuItems <- stuItems[stuDat[[idVar]]]
# make sure stuItems (now stu) has only data.frames in each element
stu <- lapply(stuItems, function(x) {
return(as.data.frame(x)[c(idVar, 'key', 'score')])
})
# subset to students who have at least one valid score
stuValid <- unlist(lapply(stu, function(x) {
return(!all(is.na(x$score)))
}))
stu <- stu[stuValid]
stuDat <- stuDat[stuValid,]
# now form X
if (regType == 'regression') {
if(missing(formula)) {
formula <- ~ 1
} # end if(missing(formula))
X <- model.matrix(formula, stuDat)
K <- ncol(X) # number of fixed parameters to estimate
nms <- c(colnames(X), 'Population SD')
startVal <- c(rep(0, K), 1)
if(nrow(X) != nrow(stuDat)) {
stop("Missing not allowed in independent variables.")
}
}
if (regType == 'popMean') {
startVal <- c(0, 1)
nms <- c('Population Mean', 'Population SD')
}
# setup controls, including startVal
con <- list(D = 1.7, startVal = startVal, min.node = -4, max.node = 4)
con[names(control)] <- control
### These are the exported functions passed to the dopar function
nodes <- seq(from = con$min.node, to = con$max.node, length.out = Q)
### This portion of the code computes all the likelihood evaluations
### and does so outside of the function that is maximized
### This saves a lot of overhead by using fixed quadrature points
if(multiCore) {
rr1 <- calcRR1_dopar(stu, Q, polyModel, paramTab, missingCode, missingValue, con$D, nodes)
} else {
rr1 <- calcRR1(stu, Q, polyModel, paramTab, missingCode, missingValue, con$D, nodes)
}
if (regType == 'popMean') {
regType <- "regression"
X <- matrix(1, nrow=ncol(rr1), ncol=1)
}
if (regType == 'regression') {
fn2 <- fn.regression(X_=X, i=NULL, wv=weightvar, rr1=rr1, nodes=nodes, stuDat=stuDat)
opt <- bobyqa(con$startVal, fn2)
opt$par[length(opt$par)] <- abs(opt$par[length(opt$par)])
iter <- opt$feval
convergence <- opt$msg
names(opt$par) <- c(colnames(X), "s")
}
# get location and scale from paramTab
location <- 0
scale <- 1
if(!is.null(attr(paramTab, "location"))) {
location <- attr(paramTab, "location")
if(!is.null(attr(paramTab, "scale"))) {
scale <- attr(paramTab, "scale")
} else {
scale <- 1
}
} else {
location <- 0
scale <- 1
}
if(is.null(weightvar)) {
obs <- sum(apply(rr1,2,sum)>0)
} else {
obs <- sum(stuDat[,weightvar]>0 & apply(rr1,2,sum)>0) # number with positive weight
}
coefficients <- opt$par
names(coefficients) <-nms
structure(list("call" = call,
"coefficients" = coefficients,
"LogLik" = -1/2*opt$fval,
"X" = X,
'Convergence' = convergence,
"location" = location,
"scale" = scale,
"lnlf"= fn2,
"rr1"= rr1,
"stuDat" = stuDat,
"weightvar" = weightvar,
"nodes" = nodes,
iterations = iter,
"obs" = obs),
class = "mml.means")
}
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.