library(wjake) library(showtext) library(glue) set_theme(base_family = "Open Sans", plot_margin = ggplot2::margin(10, 10, 10, 10)) font_add_google("Open Sans") showtext_auto() showtext_opts(dpi = 192) knitr::opts_chunk$set( collapse = TRUE, comment = "#>", fig.width = 7.2916667, fig.align = "center", out.width = "90%" ) options(mc.cores = 4, tidyverse.quiet = TRUE)
In this article, we will describe the different options for evaluating diagnostic classification models (DCMs; also known as cognitive diagnostic models [CDMs]) using measr. We start with the data to analyze, estimate our DCM, and learn how to evaluate different aspects of the model such as model fit and reliability.
To use the code in this article, you will need to install and load the measr package.
library(measr)
To demonstrate the model fit functionality of measr, we'll use the same simulated data set that was used to illustrate model estimation functionality. This data set contains 2,000 respondents and 20 items that measure a total of 4 attributes, but no item measures more than 2 attributes. The data was generated from the loglinear cognitive diagnostic model (LCDM), which is a general model that subsumes many other DCM subtypes [@lcdm]. To demonstrate model fit functionality, we'll first fit an LCDM and a deterministic-input, noisy "and" gate (DINA) model [@dina] to the data set in order to compare the fit indices. Because the LCDM was used to generate our fake data, we expect our estimated LCDM model to perform well. On the other hand, the DINA model places heavy constraints on the LCDM, and therefore we expect worse performance from the DINA model. For details on model estimation, see Estimating diagnostic classification models.
library(tidyverse) sim_data <- read_rds("data/simulated-data.rds") lcdm <- measr_dcm(data = sim_data$data, qmatrix = sim_data$q_matrix, resp_id = "resp_id", item_id = "item_id", type = "lcdm", method = "mcmc", backend = "cmdstanr", iter_warmup = 1000, iter_sampling = 500, chains = 4, parallel_chains = 4, file = "fits/sim-lcdm") dina <- measr_dcm(data = sim_data$data, qmatrix = sim_data$q_matrix, resp_id = "resp_id", item_id = "item_id", type = "dina", method = "mcmc", backend = "cmdstanr", iter_warmup = 1000, iter_sampling = 500, chains = 4, parallel_chains = 4, file = "fits/sim-dina.rds")
There are three major types of evaluations that are supported by measr.
Each of these are discussed in turn.
Absolute model fit measures how well the estimated model fits the observed data.
One of the more popular methods for evaluating absolute model fit for DCMs is the M2 statistic [@hansen2016; @liu2016].
We can calculate the M2 statistics with the fit_m2()
function.
fit_m2(lcdm)
This function returns a data frame with the M2 statistic (m2
), and the degrees of freedom and p-value associated with the M2 (df
and pval
, respectively).
A p-value less than .05 would typically indicate poor model fit.
As expected in this example, the estimated LCDM shows adequate model fit to our data (p > .05).
The fit_m2()
also return information on the RMSEA and SRMSR fit statistics.
The M2, RMSEA, and SRMSR are all considered limited-information fit indices.
For example, the M2 is based off of univariate item statistics and the bivariate relationships between items.
Thus, it cannot capture aspects of model fit from higher-order relationships (e.g., triplets of items).
Because we used a fully Bayesian estimation of our models, we can use the posterior distributions to provide another evaluation of model fit that can incorporate more information. This method is known as a posterior predictive model check (PPMC). The general process for PPMCs is as follows:
If the observed value falls within the posterior distribution of the summary, this is evidence that our estimated model is consistent with the observed data. On the other hand, discrepancies between the observed value and posteriors indicates inconsistencies.
The PPMCs can be used to evaluate both model- and item-level fit. At the model level, fit is evaluated by the expected raw score distribution, as described by @park2015 and @thompson2019. At the item level, we can evaluate fit by examining the expected conditional probabilities of members of each class providing a correct response [@sinharay2007; @thompson2019], as well as the expected odds ratio between each pair of items [@park2015; @sinharay2006].
With measr, PPMCs can be calculated with fit_ppmc()
.
Using fit_ppmc()
can specify which PPMCs to calculate.
For example, here we estimate just the model-level raw score check by setting item_fit = NULL
.
fit_ppmc(lcdm, model_fit = "raw_score", item_fit = NULL)
fit_ppmc()
returns a list, where each element is a different PPMC.
Here, we only specified one PPMC, so we only get the raw_score
element back.
The obs_chisq
is the raw score χ^2^ values from our observed data.
We then see the mean of the posterior distribution for the χ^2^, quantiles of the posterior distribution, and the posterior predictive p-value (ppp).
The ppp is the proportion of posterior draws that are greater than the observed value.
Values close to 0 indicate poor fit.
Values close to 1 may indicate overfitting.
Because the LCDM was used to generate this data, it's not surprising that the ppp is approaching 1 for our estimated model, as the model is perfectly capturing the data generating process.
We can also specify the posterior quantiles that are returned. For example, we can calculate the item-level odds ratios and request quantiles that will result in a 90% credible interval.
fit_ppmc(lcdm, model_fit = NULL, item_fit = "odds_ratio", probs = c(0.05, 0.95))
We see similar output for the item-level indices: the observed odds ratio for each item pair (obs_or
), the mean of the posterior distribution for each item pair (ppmc_mean
), the quantiles of the posterior that we specified, and the ppp.
As with the raw score distribution, here the ppp represents the proportion of posterior draws where the odds ratio is greater than the observed value.
Relative fit measures are used to compare multiple competing models. In this case, we have estimate an LCDM and a DINA model and want to compare which model performs better. We should first note that "better" does not necessarily mean "good." Evidence that one model performs better than another is not evidence of "good" fit in an absolute sense. Only absolute model fit indices can provide evidence of adequate fit to the data. However, if multiple models show adequate absolute fit, relative fit indices can be used, along with our understanding of the constructs, to select a preferred model.
Currently, the Stan ecosystem supports two information criteria through the loo package that can be used as relative fit indices: leave-one-out (LOO) cross validation with Pareto-smoothed importance sampling [@loo-waic; @psis] and the widely applicable information criterion (WAIC) described by @waic.
The information criteria can be calculated using the associated functions from the loo package (i.e., loo()
and waic()
).
Here, we calculate the LOO for both the LCDM and DINA models.
lcdm_loo <- loo(lcdm) lcdm_loo dina_loo <- loo(dina) dina_loo
In isolation, this output is not very useful, as these indices are meant to facilitate model comparisons.
We can conduct model comparisons using loo_compare()
, which is used for comparing both LOO and WAIC estimates.
In the output, the model in the first row is the preferred model.
In subsequent rows, the epld_diff
column reports the difference in the information criteria (in this case the LOO) between the model in that row and the preferred model.
The se_diff
column is the standard error of the difference between that model and the preferred model.
loo_compare(list(lcdm = lcdm_loo, dina = dina_loo))
loo_comp <- loo_compare(list(lcdm = lcdm_loo, dina = dina_loo))
@bengio2004 have recommended a cutoff of 2.5 standard errors for identifying the preferred model.
For example, because the absolute value of the elpd_diff
is greater than r glue("{round(loo_comp[2, 'se_diff'], 1)} × 2.5 = {round(loo_comp[2, 'se_diff'] * 2.5, 1)}")
, we would conclude that the LCDM fits significantly better than the DINA model.
This is our expected outcome in this case, given the data generation process and the results of the absolute model fit analysis.
If the difference were less than 2.5 standard errors, we would conclude that the models fit equally well.
We can also evaluate DCMs through their reliability.
That is, it's important to understand the accuracy and consistency of the classifications that are made by the model.
For models estimated with measr, estimates of reliability can be calculated using reliability()
.
reliability(lcdm)
There are several types of reliability evidence that are provided in the output.
For all indices reported, values can range from 0 to 1, where 1 represents perfect accuracy or consistency.
The first type reliability in the output is pattern-level reliability, as described by @cui2012.
This reliability describes the accuracy (p_a
) and consistency (p_c
) of the classification of respondents into an overall profile of proficiency on the on assessed skills.
For example, in a 3-attribute assessment there are 8 possible profiles: [0,0,0], [1,0,0], [0,1,0], [0,0,1], [1,1,0], [1,0,1], [0,1,1], [1,1,1].
One option for reporting results from a DCM-based assessment is to select the profile that is most likely for each respondent.
These pattern-level reliability metrics provide a measure of the accuracy and consistency for this type of classification.
On the other hand, rather than reporting results based on the overall most likely profile, we can assign proficiency for each individual attribute, and build up a overall profile from the attribute-level decisions.
This type of reporting may or may not result in the same profile as the overall most likely profile.
Because in this scenario classifications are made at the attribute level, we need to examine the classification accuracy and consistency for each individual attribute.
For models estimated with measr, this is referred to maximum a posteriori (MAP) reliability, because classifications are based on the most likely category for each attribute for each respondent.
The MAP values reported in the output are those described by @johnson2018.
The acc
and consist
variables in map_reliability$accuracy
and map_reliability_consistency
tables, respectively, are the classification accuracy and consistency metrics described by @johnson2018.
In their paper, they also compared these metrics to other measures of agreement (e.g., Cohen's κ, Goodman and Kruskal's λ), which are also included in the output.
@johnson2018 also provide recommendations for threshold values for each metric that represent poor, fair, good, very good, or excellent reliability.
Finally, rather than reporting results as classifications (e.g., proficient/not proficient), results can also be reported simply as the probability that each respondent is proficient on each attribute.
Thus, rather than reporting the accuracy and consistency of a classification, we should report the precision of the reported probability.
This is referred to as expected a posteriori (EAP) reliability, because the probability represents the expected value of each attribute for each respondent.
The EAP values reported in the output (eap_reliability
) are those described by @johnson2020 and @templin2013.
@templin2013 describe a reliability index based on a restrictive assumption of parallel forms (rho_tb
).
@johnson2020 described a more generalized parallel forms reliability (rho_pf
), along with additional biserial (rho_bs
), and informational (rho_i
) reliability indices.
Because proficiency probabilities are provided at the attribute level, the EAP reliability estimates are also provided for each attribute measured by the assessment.
As with the classification reliability indices, @johnson2020 provide recommendations for thresholds representing poor, fair, good, very good, and excellent reliability for all four EAP reliability indices.
If you have followed along with running the code in this vignette, you will have noticed that some of the model evaluations take a significant amount of computation time.
This means repeating the calculations (e.g., didn't assign the output new a new object, opened a new R session) can be a very time-consuming process.
To make you analysis more efficient, measr offers several functions that can be used to add the model evaluation metrics described in this vignette directly to the model object.
If you specified a file
when the model was estimated, the updated model object with the model evaluation components will automatically resave, ensuring that you don't have to rerun computationally intensive tasks.
There are three functions for adding model evaluation components to a model object, which correspond to the three types of evaluation described in this vignette:
add_fit()
: Adds absolute model fit indices (i.e., M2, PPMCs)add_criterion()
: Adds relative model fit indices (i.e., LOO, WAIC)add_reliability()
: Adds reliability metricsAll three functions have several arguments in common.
x
: The model to add evaluation components to.save
: Whether to resave the model object if file
was specified when estimating the model. The default is TRUE
.overwrite
: Whether to overwrite existing evaluations. For example, if you attempt to add reliability metrics with add_reliability()
, but those metrics have already been added, should the reliability metrics be recalculated and overwrite the existing metrics? The default is FALSE
.Additionally, all three functions have a ...
argument for passing additional arguments along to the relevant functions.
For example, if we want to add PPMC absolute model fit indices, we can specify the types of model- and item-level fit indices to calculate, just as we did when using fit_ppmc()
.
lcdm <- add_fit(lcdm, method = "ppmc", model_fit = "raw_score", item_fit = "odds_ratio")
Once components have been added to the model, a helper function, measr_extract()
, can be used to pull out relevant pieces of output.
For example, we can extract the PPMC raw score results.
measr_extract(lcdm, what = "ppmc_raw_score")
In addition, we can also extract elements of the model, such as the priors that were used during estimation, or estimated parameters like the base rate of membership in each class.
measr_extract(lcdm, what = "prior") measr_extract(lcdm, what = "strc_param")
For a complete list of what can be extract with measr_extract()
, see ?measr_extract
.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.