dexterMST: dexter for Multi-Stage Tests"
In dexterMST: CML and Bayesian Calibration of Multistage Tests

wzxhzdk:0 wzxhzdk:1 __dexterMST__ is a new R package acting as a companion to __dexter__ [@dexter] and adding facilities to manage and analyze data from **multi-stage tests (MST)** as they are found in educational measurement [@yan2014overview]. The package includes functions for importing and managing test data, assessing and improving the quality of data through basic test and item analysis, and fitting an IRT model; all adapted to the peculiarities of MST designs. Its main contribution is in the analysis of the data. It offers, in particular, the possibility to calibrate item parameters from MST using either Conditional Maximum Likelihood (CML) estimation [@ZwitserMaris2015] or a Gibbs sampler for Bayesian inference [@koopsmst]. It has, for instance, no facilities for automatic test assembly. ## What does it do? MST must be historically the earliest attempt to achieve _adaptivity_ in testing. In a traditional, non-adaptive test, the items that will be given to the examinee are completely known before testing has started, and no items are added until it is over. In adaptive testing, the items asked are, at least to some degree, contingent on the responses given, so the exact contents of the test only becomes known at the end. [@bejar2014] gives a nice overview of early attempts at adaptive testing in the 1950s. Other names for adaptive testing used in those days were tailored testing or response-contingent testing. Note that MST can be done without any computers at all, and that computer-assisted testing does not necessarily have to be adaptive. When computers became ubiquitous, full-scaled computerized adaptive testing (CAT) emerged as a realistic option. In CAT, the subject's ability is typically reevaluated after each item and the next item is selected out of a pool, based on the interim ability estimate. In MST, adaptivity is not so fine-grained: items are selected for administration not separately but in bunches, usually called _modules_. In the first stage of a MST, all respondents take a _routing test_. In subsequent stages, the modules they are given depend on their success in previous modules: test takers with high scores are given more difficult modules, and those with low scores are given easier ones -- see e.g., @zenisky2009, @hendrickson2007ncme, or @yan2014overview. To get closer to actual work with MST, it is convenient to represent the test design with a _tree diagram_. A very simple example is shown below: wzxhzdk:2 The tree is read from top to bottom. The root represents the first stage where all examinees take the first module (the routing test). In the second stage, examinees with a score lower than or equal to 5 on the routing test take module 2, whereas examinees with a score higher than 5 on the routing test take module 3. Every path (from the routing test to the last module) corresponds to a _booklet_. In this MST, there are two booklets: the first one, booklet M1-M2, should be relatively easy while the other, M1-M3, is more difficult. Note that the thickness of a path indicates how many respondents took it. Unlike CAT, MST is not all about algorithms. Most concepts, steps, procedures known from linear testing and familiar from __dexter__ are still in place, but in a modified form: some are a bit more complicated, others quite a bit, a few become meaningless. We next review the basic workflow, which is similar to __dexter__ with a few important additions, and we discuss some of the differences in more detail. ## How do I use it? There are not too many workflows to manage and analyze test data. In __dexter__, the most common procedure is, basically: 1. Start a project 2. Declare the scoring rules for all items 3. Input data 4. Examine data, assess quality with classical statistics, make necessary adjustments 5. Estimate item parameters 6. DIF, profile analysis ... 7. Estimate and analyze proficiencies In __dexterMST__, we follow more or less the same path except that we must, between steps 2 and 3, communicate the MST structure to the program: the modules, the booklets, and the routing rules. ### Create a MST project The first step in __dexterMST__ is to create a new project (actually, an empty data base). In this example, the project is created in memory, and it will be lost when you close R. To create a permanent project, simply replace the string ":memory:" by a file name. wzxhzdk:3 ### Supply the scoring rules Just like in __dexter__, the first really important step is to supply the _scoring rules_: an exhaustive list of all items that will appear in the test, all admissible responses, and the score that will be assigned to each response when grading the test. These must be given as a data.frame with three columns: `item_id`, `response` and `item_score`: the first two are strings, and the scores are integers with always 0 as the lowest possible score. If you have scored data, you can simply make the column response match the item_score, as in the following example: wzxhzdk:4 wzxhzdk:5 ### Define the test design In the simpler case of multi-booklet linear tests, __dexter__ is able to infer the test design from the scoring rules and the test data. With MST, we have to work some more and provide information on the _modules_ and the _routing rules_. First, the __modules__. Create another data.frame with columns `module_id`, `item_id` and `item_position`: wzxhzdk:6 Note that the items have been sorted in difficulty which is why module 2 contains the first items. __Routing rules__ specify the rules to pass from one module to the next. We have supplied a function, `mst_rules`, which lets you define the routing rules using a simple syntax. The following example defines two booklets (remember, booklets are paths) called "easy" and "hard". The "easy" booklet consists of the routing test, here called `Mod_1`, and module `Mod_2`; it is given to examinees who scored between 0 and 5 on the routing test. Booklet "hard" consists of the routing test and module `Mod_3`, and is given to examinees who scored between 6 and 10 on the routing test. Obviously, the command language is a simple transcription of the tree diagram on the previous illustration: read `--+` as arrow from left to right and `[0:5]` as a score range, here from zero up to and including five. wzxhzdk:7 Having defined the two crucial elements of the design, the modules and the routing rules, use function `create_mst_test` to combine them and give the test a name, in this case `ZwitserMaris`: wzxhzdk:8 Currently, we support two possible types of routing, `all` and `last`, with `all` the default. The difference lies in whether routing is based the score obtained on the last module a person took, or on all previous modules. We discuss this in detail below. ### Enter test data With the test defined, you can enter data. This can be done in two ways: booklet per booklet in wide form, or all booklets at once. The former is illustrated below; the latter works if the data is in long format (also called normalized or tidy, see [@Wickham2017]). wzxhzdk:9 To enter the data in wide format, we call function `add_booklet_mst` twice, once for each booklet: wzxhzdk:10 ### Inspect and analyze the data Before we attempt to fit IRT models to the data, it is common practice to compute and examine carefully some simpler statistics derived from Classical Test Theory (CTT). IRT and CTT have largely overlapping ideas of what constitutes a good item or a good test, derived from common substantive foundations. If, for example, we find that the scores on an item correlate negatively with the total scores on the test, this is a sign that something is seriously amiss with the item. The presence of such problematic items will decrease the value of Cronbach alpha, and so on. Unfortunately, CTT statistics are all badly influenced by the score range restrictions and dependencies inherent in MST designs. Therefore, their usefulness is severely limited, except perhaps in the first module of a test. The good news is that the __interaction model__ [@IM], which we advocated in __dexter__ as a model-driven alternative to CTT, can be adapted to MST designs, making it possible to retrieve the item-regression curves conditional on the routing design. This is best appreciated in graphical form: wzxhzdk:11 The plots are similar to those in __dexter__ except that some scores are ruled out due to the design of the test. The interaction model can only be fitted on one booklet at a time, but this includes the rather complicated MST booklets. A more detailed discussion of the item-total regressions may be found in __dexter__'s vignettes or on [our blog](https://dexterities.netlify.app/2018/02/25/item-total-regressions-in-dexter/). ### Estimate the IRT model Similar to __dexter__, __dexterMST__ supports the **Extended Nominal Response Model (ENORM)** as the basic IRT model. To the user, this looks and feels like the Rasch model when items are dichotomous, and as the partial credit model otherwise. Fitting the model is as easy as can be: wzxhzdk:12 wzxhzdk:13 What happens under the hood is _not_ simple, so we discuss it as some more length in a separate section below. ### DIF etc. __dexterMST__ does include, as we write, generalizations of the exploratory test for DIF known from __dexter__ [@BechgerMarisDIF] and of profile analysis [@VerhelstPA]. We feel that these are a bit beyond fundamentals, and suffice with an example. Let us add an invented item property using the `add_item_properties_mst` function and use `profile_tables_mst` to calculate the expected score on each item domain given the booklet score. wzxhzdk:14 The following plot shows these expected domain-scores for each of the two booklets. wzxhzdk:15 The plot shows the expected domain scores as lines and the average domain-scores found in the data as dots. For comparison, the dashed lines are the expected domain scores calculated using __dexter__. These are not correct because they ignore the design. ### Ability estimation __dexterMST__ re-exports a number of __dexter__ functions that can work with the parameters object returned from `fit_enorm_mst`, notably `ability`, `ability_tables`, and `plausible_values`. In the example below, we use maximum likelihood estimation (MLE) to produce a score transformation table: wzxhzdk:16 wzxhzdk:17 More examples are given below. ### Subsetting: using predicates __dexter__ implements a flexible and general infrastructure for subsetting data via the `predicate` argument, which is available in many of the package functions. Predicates can use item properties, person covariates, booklet and item IDs, and other variables to filter the data that will be processed by the function. We have tried very hard to preserve this mechanism in __dexterMST__. For example, the same analysis as above but without item `item21` is done as follows: wzxhzdk:18 However, because of the intricate dependencies in MST designs, subsetting is not trivial. We have provided some explanations in a separate section of this document. This concludes our brief tour of a typical workflow with __dexterMST__. The rest of this document will examine in more detail CML estimation in __dexterMST__, how to specify designs with more than two stages, and some intricacies with the use of predicates. We conclude with a brief overview of the main differences between __dexter__ and __dexterMST__. ## CML estimation with MST One should be careful not to apply __dexter__'s estimation routines in MST without thinking. Ordinary CML, in particular, is known to gives biased results under the circumstances (@EggenVerhelst11, @glas1988). Recently, @ZwitserMaris2015 demonstrated that CML estimation is possible, provided that one takes the design into account. Furthermore, they argued that sensible models aka those that admit CML will in general fit quite well to MST data. The same theory was used to adapt __dexter__'s Bayesian estimation method to MST [@koopsmst]. In the three years since, the results of @ZwitserMaris2015 have not been mentioned in any of the recent edited volumes on MST, and __dexterMST__ is, to our best knowledge, the first publicly available attempt at a practical implementation. MST data are usually analyzed with _marginal maximum likelihood (MML)_, which is available in a number of R-packages, such as `mirt` and, lately, `dexterMML`. MML estimation makes assumptions about the shape of the ability distribution (usually a normal distribution is assumed), and it can produce unbiased estimates if these assumptions are fulfilled. CML, on the other hand, does not need any such assumptions at all, so it can be expected to perform well in a broader class of situations. That MML gives biased results if the ability distribution is misspecified has been shown quite convincingly by @ZwitserMaris2015. We reproduce their example here without the code but note that the data have been simulated with an ability distribution that is not normal but a mixture of two normals. Here is a density plot. wzxhzdk:19 Note that a distribution that is not normal is not, in any way, ab-normal. In education, skewed or multi-modal distributions like this do occur as a result of many kinds of selection procedures. Below are the estimated item difficulty parameters plotted against the true parameters: wzxhzdk:20 The results illustrate the well-known fact that both naive CML and MML estimates can be severely biased. The latter are not biased because of the MST design but because the population distribution was misspecified. Note that the colors indicate whether the items occurred in module 1, module 2 or module 3. It will be clear that the modules are appropriately, albeit not perfectly, ordered in difficulty. Judging from the p-values, the (simulated) respondents would have been comfortable with the test. wzxhzdk:21 wzxhzdk:22 Note how the __dexter__ function `tia_tables` was used. To wit, we first got the response data using `get_responses_mst` and used these as input. This detour was necessary because the data-bases of __dexter__ and __dexterMST__ are not directly compatible. In the same way, we can use __dexter__'s `plausible_values` function to show that the original distribution of person abilities can be approximated quite well by the distribution of the plausible values: wzxhzdk:23 Note that the output from `fit_enorm_mst` can be used in dexter functions without further tricks. For further reading, we refer the reader to the __dexter__ vignette on plausible values. ## Predicates and MST Response-contingent testing, to use the charming term from the past, introduces many intricate constraints and dependencies in the test design. As a result, not only the complicated techniques, but even such apparently trivial operations as removing items from the analysis or changing a scoring rule can become something of a minefield. Things are never as easy as in a linear test! In CML calibration, it is essential that we know the routing rules so, to remove some items from analysis, one needs to infer the MST design without these items. What happens internally in __dexterMST__ is that a new MST is defined for each possible score over the items that are left out, with the routing rules that follow from the ones originally specified. Consider, for example, the design that corresponding to the analysis with `item21` left out. wzxhzdk:24 As one can see, the tree is _split_ to accommodate the examinees who answered this item correctly, and those who did not. While more complex predicates are allowed, it will be clear that these may involve complicated bookkeeping which takes time and might slow down the analysis. Some limitations remain, unfortunately. At the time of writing, it is not possible to change the scoring rules after test administration, except for items in the last modules of the test. Predicates that remove complete modules from a test, e.g. `module_id != 'Mod_1'` will cause an error and should be avoided. ## Beyond two stages: 'all' and 'last' routing So far, we have considered the simplest MST design with just two _stages_. __dexterMST__ can handle much more complex designs involving many stages and modules. As an example, we show a diagram corresponding to one of the MST tests used in the 2018 edition of Cito's *Adaptive Central End Test (ACET)*: wzxhzdk:25 ![One ACET test](Plot_RRA1.png) With more than two stages, two different methods of routing become possible: __Last__: Use the score on the present module only.
__All__: Use the score on the present and the previous modules.
__dexterMST__ fully supports both types of routing. We do require that a routing type applies to a complete test and is specified when the test is created, for example `create_mst_test(..., routing='last')`. In this vignette we used 'all', which is the default value. It is worth noting that the ACET project includes both MST and linear tests. The linear tests are simply entered as a single module MST with a trivial routing rule, e.g.: wzxhzdk:26 ACET is a large project; although even larger projects exist such as the PIAAC study @oecd2013technical. In total, the project database contains the responses of 97225 students to 3622 items spread over 169 tests and including six distinct MSTs. Could __dexterMST__ analyse the data? Sure! On a Windows 64-bit laptop with an 2.9 GHz processor, this took about 1.5 minutes to calibrate. wzxhzdk:27 ## dexter vs. dexterMST: A summary for dexter users __dexterMST__ is a companion to __dexter__. It loads __dexter__ automatically, and many of that __dexter__'s functions can be used immediately, notably those for ability estimation. When that is not the case, there will be some kind of warning. The new functions relevant for MST have `mst` in their names. In addition, we have tried to keep the general logic and workflow as similar to __dexter__ as possible. Thus, experienced __dexter__ users should find __dexterMST__ easy to understand. Some of the most important differences are listed below. * It is no longer possible to infer the test design automatically from the scoring rules and the data; the user must specify the test design explicitly through the modules and the routing rules before the response data can be input. * In __dexterMST__ there are restrictions on altering the scoring rules. Specifically, it is not possible to change scoring rules for any items unless they only appear in the last stage of a test. Attempts to circumvent this restriction by changing the scoring rules midway will lead to wrong calibration results. * __dexter__ can work with or without a project database. Due to the extra dimensions of the data in MST designs, __dexterMST__ requires a project database. * The data bases created by __dexter__ and __dexterMST__ are not compatible or exchangeable but the parameter object that is the output from `fit_enorm_mst` can be used in dexter functions like `plausible_values` and `ability` as we have illustrated earlier. For your convenience, a function `import_from_dexter` is available that will import items, scoring rules, persons, test designs and responses from a dexter database into the dexterMST database.