library(wjake) library(showtext) set_theme(base_family = "Open Sans", plot_margin = ggplot2::margin(10, 10, 10, 10)) font_add_google("Open Sans") showtext_auto() showtext_opts(dpi = 192) knitr::opts_chunk$set( collapse = TRUE, comment = "#>", fig.width = 7.2916667, fig.align = "center", out.width = "90%" ) options(mc.cores = 4, tidyverse.quiet = TRUE)
In this article, we will walk you through the steps for estimating diagnostic classification models (DCMs; also known as cognitive diagnostic models [CDMs]) using measr. We start with the data to analyze, understand how to specify a DCM to estimate, and learn how to customize the model estimation process (e.g., prior distributions).
To use the code in this article, you will need to install and load the measr package.
library(measr)
To demonstrate the model estimation functionality of measr, we'll examine a simulated data set. This data set contains 2,000 respondents and 20 items that measure a total of 4 attributes, but no item measures more than 2 attributes. The data was generated from the loglinear cognitive diagnostic model (LCDM), which is a general model that subsumes many other DCM subtypes [@lcdm]. By using a simulated data set, we can compare the parameter estimates from measr to the true data generating parameters.
library(tidyverse) sim_data <- read_rds("data/simulated-data.rds") sim_data$data sim_data$q_matrix
In measr, DCMs are specified and estimated using the measr_dcm()
function.
We'll start by estimating a loglinear cognitive diagnostic model (LCDM).
The LCDM is a general DCM that subsumes many other DCM subtypes [@lcdm].
First, we specify the data (data
) and the Q-matrix (qmatrix
) that should be used to estimate the model.
Note that these are the only required arguments to the measr_dcm()
function.
If no other arguments are provided, sensible defaults (described below) will take care of the rest of the specification and estimation.
Next, we can specify which columns, if any, in our data
and qmatrix
contain respondent identifiers and item identifiers in, respectively.
If one or more of these variables are not present in the data, these arguments can be omitted, and measr will assign identifiers based on the row number (i.e., row 1 in qmatrix
becomes item 1).
We can then specify the type of DCM we want to estimate.
The current options are "lcdm"
(the default) "dina"
, and "dino"
(see Estimating Other DCM Sub-Types below).
You also have the option to choose which estimation engine to use, via the "backend"
argument.
The default backend is backend = "rstan"
, which will use the rstan package to estimate the model.
Alternatively, you can use the cmdstanr package to estimate the model by specifying backend = "cmdstanr"
.
The cmdstanr package works by using a local installation of Stan to estimate the models, rather than the version that is pre-compiled in rstan.
Once a backend has been chosen, we can supply additional arguments to those specific estimating functions.
In the example below, I specify 500 warm-up iterations per chain, 500 post-warm-up iterations per chain, and 4 cores to run the chains in parallel.
The full set up options available for rstan and cmdstanr can be found by looking at the help pages for rstan::sampling()
and cmdstanr::`model-method-sample`
, respectively.
Finally, because estimating these models can be time intensive, you can specify a file
.
If a file is specified, an R object of the fitted model will be automatically saved to the specified file.
If the specified file
already exists, then the fitted model will be read back into R, eliminating the need to re-estimate the model.
lcdm <- measr_dcm(data = sim_data$data, qmatrix = sim_data$q_matrix, resp_id = "resp_id", item_id = "item_id", type = "lcdm", method = "mcmc", backend = "cmdstanr", iter_warmup = 1000, iter_sampling = 500, chains = 4, parallel_chains = 4, file = "fits/sim-lcdm")
Now that we've estimated a model, let's compare our parameter estimates to the true values used to generate the data.
We can start be looking at our estimates using measr_extract()
.
This function extracts different aspects of a model estimated with measr.
Here, the estimate
column reports estimated value for each parameter and a measure of the associated error (i.e., the standard deviation of the posterior distribution).
For example, item A1 measures two attributes and therefore has four parameters:
att2
and att4
).item_parameters <- measr_extract(lcdm, what = "item_param") item_parameters
We can compare these estimates to those that were used to generate the data. In the figure below, most parameters fall on or very close to the dashed line, which represents perfect agreement, indicating that the estimated model is accurately estimating the parameter values.
#| fig.asp: 0.618 #| fig.alt: > #| Figure shows a strong correlation between item parameters, with only a few #| discrepancies off of the line of perfect agreement. library(tidyverse) library(glue) param_compare <- item_parameters %>% left_join(sim_data$true_items, by = c("item_id", "class", "attributes", "coef"), relationship = "one-to-one") %>% mutate(measr_est = map_dbl(estimate, mean), type = case_when(str_detect(coef, "_0") ~ "Intercept", str_detect(coef, "_1") ~ "Main Effect", str_detect(coef, "_2") ~ "Interaction"), type = factor(type, levels = c("Intercept", "Main Effect", "Interaction"))) example_item <- filter(param_compare, item_id == "A15") example_measr <- deframe(select(example_item, attributes, measr_est)) example_true <- deframe(select(example_item, attributes, value)) msr_colors <- c("#8ECAE6", "#023047", "#D7263D") param_compare %>% ggplot(aes(x = measr_est, y = value)) + geom_abline(intercept = 0, slope = 1, linetype = "dashed") + geom_point(aes(color = type, shape = type), size = 3) + scale_color_manual(values = msr_colors) + labs(x = "measr", y = "True Value", color = "Parameter Type", shape = "Parameter Type")
We can also examine the structural parameters, which represent the overall proportion of respondents in each class. Again, we see relatively strong agreement between the estimates from our model and the true generating values.
#| fig.asp: 0.618 #| fig.alt: > #| Figure shows a strong correlation between item parameters, with only a few #| discrepancies off of the line of perfect agreement. true_strc <- create_profiles(4) %>% rowwise() %>% mutate(label = glue("[{att1},{att2},{att3},{att4}]")) %>% ungroup() %>% select(label) %>% mutate(value = sim_data$true_strc) strc_compare <- measr_extract(lcdm, "strc_param") %>% left_join(true_strc, join_by(class == label), relationship = "one-to-one") %>% mutate(measr_est = map_dbl(estimate, mean)) strc_compare %>% ggplot(aes(x = measr_est, y = value)) + geom_abline(intercept = 0, slope = 1, linetype = "dashed") + geom_point(size = 3, color = msr_colors[2]) + expand_limits(x = c(0, 0.1), y = c(0, 0.1)) + labs(x = "measr", y = "True Value")
In the code to estimate the LCDM above, we did not specify any prior distributions in the call to measr_dcm()
.
By default, measr uses the following prior distributions for the LCDM:
default_dcm_priors(type = "lcdm")
As you can see, main effect parameters get a lognormal(0, 1)
prior by default.
Different prior distributions can be specified with the prior()
function.
For example, we can specify a normal(0, 10)
prior for the main effects with:
prior(normal(0, 10), class = "maineffect")
By default, the prior is applied to all parameters in the class (i.e., all main effects).
However, we can also apply a prior to a specific parameter.
For example, here we specify a χ^2^ distribution with 2 degrees of freedom as the default prior for main effects, and an exponential distribution with a rate of 2 for the main effect of attribute 1 on just item 7.
To see all parameters (including class
and coef
) that can be specified, we can use get_parameters()
.
c(prior(chi_square(2), class = "maineffect"), prior(exponential(2), class = "maineffect", coef = "l7_11")) get_parameters(sim_data$q_matrix, item_id = "item_id", type = "lcdm")
Any distribution that is supported by the Stan language can be used as a prior.
A list of all distributions is available in the Stan documentation, and are linked to from the ?prior()
help page.
Priors can be defined before estimating the function, or created at the same time the model is estimated.
For example, both of the following are equivalent.
Here we set the prior for main effects to be a truncated normal distribution with a lower bound of 0.
This is done because the main effects in the LCDM are constrained to be positive to ensure monotonicity in the model.
Additionally note that I've set method = "optim"
.
This means that we will estimate the model using Stan's optimizer, rather than using full Markov Chain Monte Carlo.
Note that the prior still influences the model when using method = "optim"
, just as they do when using method = "mcmc"
(the default).
new_prior <- prior(normal(0, 15), class = "maineffect", lb = 0) new_lcdm <- measr_dcm(data = sim_data$data, qmatrix = sim_data$q_matrix, resp_id = "resp_id", item_id = "item_id", type = "lcdm", method = "optim", backend = "cmdstanr", prior = new_prior, file = "fits/sim-lcdm-optim") new_lcdm <- measr_dcm(data = sim_data$data, qmatrix = sim_data$q_matrix, resp_id = "resp_id", item_id = "item_id", type = "lcdm", method = "optim", backend = "cmdstanr", prior = c(prior(normal(0, 15), class = "maineffect", lb = 0)), file = "fits/sim-lcdm-optim")
The priors used to estimate the model are saved in the returned model object, so we can always go back and see which priors were used if we are unsure.
We can see that for the new_lcdm
model, our specified normal prior was used for the main effects, but the default priors were still applied to the parameters for which we did not explicitly state a prior distribution.
measr_extract(new_lcdm, "prior")
Although a primary motivation for measr is to provide researchers with software that makes the LCDM readily accessible, a few other popular DCM subtypes are also supported.
For example, we can estimate the deterministic inputs, noisy "and" gate [DINA, @junker2001] or the deterministic inputs, noisy "or" gate [DINO, @dino] models by specifying a different type
in the measr_dcm()
function.
Future development work will continue to add functionality for more DCM subtypes. If there is a specific subtype you are interested in, or would like to see supported, please open an issue on the GitHub repository.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.