maat: Multiple Administrations Adaptive Testing

library(maat)
library(knitr)
library(kableExtra)

Introduction

The maat package performs adaptive testing based on the assessment framework involving multiple tests administered throughout the year using multiple item pools vertically scaled and multiple phases within each test administration [@choi_maat_2022]. It allows for transitioning from one item pool with associated constraints to another as determined necessary according to a prespecified transition policy to enhance the quality of measurement. Based on an item pool and test blueprint constraints, each phase or module is constructed as a computerized adaptive test (CAT) assembled dynamically using the shadow-test approach to CAT [@van_der_linden_model_1998]. The current version of maat supports an assessment design involving three tests/administrations (e.g. Fall, Winter, and Spring) with two phases within each administration (Phase 1 and Phase 2), so that an examinee takes six modules in total throughout the year. The assessment framework is adaptive in multiple levels:

Assessment Structure

include_graphics("assessment.svg")

An assessment under the multiple administrations adaptive testing design has the following structure.

Assumptions

Several assumptions are made to support the multiple administrations adaptive testing design.

Module Assembly

A module is a fixed-length adaptive test constructed under the shadow-test framework. This section describes how each module is assembled.

Content Balancing Algorithm

The shadow-test approach to CAT [@van_der_linden_model_1998; @choi_ensuring_2018] was developed to balance the need for sequential and simultaneous optimization in constrained CAT. The shadow-test approach uses the methodology of mixed-integer programming (MIP) to simultaneously select a full set of items conforming to all content and other constraints and yet optimal (e.g., most informative).

Given the item pool and test specifications/constraints, the shadow-test approach to CAT assembles a full-length test form, called a shadow test, using MIP for each item position upon updating interim $\theta$ estimates, $\theta_{k}$, $k = 1,2,...,n$, where $n$ is the test length. The optimal test assembly engine uses MIP to construct shadow tests optimized for the interim $\theta$ estimates and conforming to all specifications and requirements, encompassing content constraints, exposure control, overlap control, and other practical constraints (e.g., enemy items, item types, etc.). The cycle is repeated until the intended test length $n$ has been reached.

The methods by which the shadow-test approach formulates test assembly problems as constrained combinatorial optimization have been documented in @van_der_linden_linear_2005 and implemented in the TestDesign package [@choi_testdesign_2021]. Refer to @choi_ensuring_2018 for more information about how the shadow-test approach creates an adaptive test as a sequence of optimally constructed full-length tests.

Item Selection Criteria

A standard procedure for choosing a shadow test (for a given examinee at a particular item position) among potentially an astronomical number of alternatives is to compare the objective values provided by the alternatives. The common objective function in its simplest form is:

$$ \text{maximize} \sum_{i\,=\,1}^{I} I_{i}(\hat{\theta})x_{i} $$

where $I_{i}(\hat{\theta})$ is the Fisher information for Item $i$ at an estimated $\theta$ value. It is also possible to add a random component to the objective function to reduce the overexposure of highly informative items for some or all item positions within a test. For example, the progressive method [@revuelta_comparison_1998] can be incorporated into the objective function so that at the beginning of the test the objective function combines a random component with item information, and as the test progresses the random component is reduced proportionately.

Upon constructing/updating a shadow test, a single item is selected to be administered. Selecting an item from a shadow test is typically done by selecting the most informative item in the shadow test that has not been administered, as

$$ \text{arg} \max_{i\,\in\,R} I_{i}(\hat{\theta}), $$

where $R$ indicates the set of items in the current shadow test that has not been administered to the examinee. When the test is comprised of item sets (e.g., reading passages), selecting a passage should precede selecting an item, which can be based on the average item information within each passage. Once a passage is selected, typically multiple items are selected before moving onto another passage.

How Passages Are Selected

In the MIP optimizer, passages are selected not directly but as a result of attempting to satisfy constraints. Given an item pool that has $I$ items, a discrete assembly problem (i.e., not passage-based) uses $I$ decision variables that represent each item in the pool. In a passage-based assembly that has $S$ available passages in the pool, $S$ more decision variables are added to existing $I$ decision variables. The nested structure between items and passages is provided to the solver through the use of constraints.

Using the same information maximization criterion presented above, a shadow-test that satisfies the criterion and the constraints is assembled/re-assembled for the administration of each item. From the shadow-test, the passage to be administered to the examinee is determined using the following process.

First, if the examinee is currently not within a passage, the passage that has the largest mean information at the current $\hat{\theta}$ is selected as the passage to be administered. The mean information is calculated from the shadow test. For example, suppose that Passage 1 consists of Items 1, 2, 3, 4, 5, and only Items 1, 2, 3 were included in the shadow test. In this case, the mean information of Passage 1 is computed from Items 1, 2, 3. After selecting a passage with the highest mean information, the item within the passage that has the largest information at the current $\hat{\theta}$ is administered to the examinee. This marks the passage as the currently active passage.

For the next shadow test, the assembly engine enforces to select previously administered items and passages, and the currently active passage that contains the item. In this step, for the currently active passage, a different combination of items may be selected in the shadow test. For example, suppose that Passage 1 consists of Items 1, 2, 3, 4, 5, and the constraint is to administer 3 items for each passage. If Items 1, 2, 3 were selected previously and Item 1 was administered, it is possible that Items 1, 3, 5 will be selected in the current shadow test. Given the combination, either Item 3 or 5 will be administered to the examinee depending on which item has the largest information.

Exposure Control

The maximum-information item-selection criterion causes overexposure of a small proportion of items while underexposing the rest. The shadow-test approach mitigates the problem by adding random item eligibility constraints to the test-assembly model so that items with higher exposure rates have higher probabilities of being temporarily ineligible. The TestDesign package implements the conditional item eligibility control method recently improved and extended [@van_der_linden_improving_2019]. For each examinee the TestDesign engine determines which items to constrain as temporarily ineligible from the item pool. The engine can also monitor the probabilities of ineligibility for all items conditional on different theta levels such that the exposure rates for all items in the pool are tightly controlled within and across different theta segments (intervals) and bound below a maximum exposure rate set a priori (e.g., $r^{\max}=0.25$).

More specifically, for each new examinee, prior to the administration of the test, the item eligibility control method conducts Bernoulli experiments (by theta segment) for the items in the pool to determine their eligibility for administration with probabilities updated as a function of the actual exposure rates of the items. For any items determined to be ineligible additional constraints are included in the test assembly model as follows:

$$ \sum_{i\,\in\,V_j}{x_i} = 0 $$

where $x_i$ is the binary decision variable for the selection of item $i$; and $V_j$ denotes the set of items determined to be ineligible for Examinee $j$.

The conditional item eligibility method monitors and updates the probabilities within a predetermined set of theta segments, e.g., $\theta_1 \in [-\infty,-1.5), \theta_2 \in [-1.5,-.5), \dots , \theta_G \in (1.5, \infty]$. The conditional item-eligibility probabilities are represented as a recurrence relationship as follows:

$$ \text{Pr}{E_i | \theta} \leq \frac{r^{\max}} {\text{Pr}{A_i | \theta}} \text{Pr}{E_i | \theta}, $$

where $\text{Pr}{E_i | \theta}$ is the conditional eligibility probability for item $i$ given $\theta \in \theta_g$; and $\text{Pr}{A_i | \theta}$ is the conditional exposure probability (rate) for the item. Theoretically, $\text{Pr}{A_i | \theta}$ can be updated continuously as each examinee finishes the test. Assuming $l = 1,2,\dots$ denote the sequence of examinees taking the test. The conditional item-eligibility probabilities can be updated continuously as:

$$ \text{Pr}^{l+1}{E_{i}|\theta} = \min \bigg{ \frac{r^{\max}} {\text{Pr}^{l}{A_{i}|\theta}} \text{Pr}^{l}{E_{i}|\theta}, 1 \bigg} $$

However, in the context of a large number of concurrent test instances updating the exposure counts in real time after each instance can be difficult and perhaps not necessary. One complication with the conditional item eligibility control method is that as the test progresses examinees may move in and out of segments and can be subject to different sets of eligible items as they typically visit more than one theta segment over the course of a test administration. @van_der_linden_improving_2019 elaborates the issue and provides a workaround. Unconditional exposure control is much more straightforward to implement and can be preferred in many testing situations. The TestDesign package implements the conditional item eligibility control method based on configurable $\theta$ segments. Defining one big segment of $\theta$ simplifies the method to the unconditional case.

Overlap Control

Overlap control might be needed to prevent or reduce the amount of intra-individual overlap in test content across administrations. The item eligibility control method can be used to make all items previously seen by the examinee ineligible for the current administration by imposing constraints similarly as

$$ \sum_{i\,\in\,S_{j}}{x_{i}} = 0, $$

where $s_j$ denotes the set of items Examinee $j$ has seen prior to the current administration. Imposing these hard constraints can unduly limit the item pool and potentially affect the quality of measurement. To avoid infeasibility and degradation of measurement, we can impose soft constraints in the form of a modification to the maximum information objective function as

$$ \text{maximize} \sum_{i\,=\,1}^{I}I_{i}{(\theta) x_{i}} \, – \, M \sum_{i\,\in\,s_{j}}{x_{i}}, $$

where $M$ is a penalty for selecting an item from $s_j$ the subset of items previously administered to Examinee $j$. This modification to the objective function can effectively deter the selection of items previously administered unless absolutely necessary for feasibility of the model.

Although the same item eligibility constraints for inter-individual exposure control can be used to control intra-individual item overlap, the mechanism for identifying ineligible items for the intra-individual overlap control is quite different. It requires tracking the examinee records across test administrations, which may be days, weeks, or months apart. As the number of administrations increases, the ineligible item set ($s_j$) can grow quickly and adversely affect the quality of measurement progressively. To prevent the ineligible item set from growing quickly, $s_j$ may need to be defined based only on the immediately preceding test administration. Another possibility is to let the penalty $M$ be subject to exponential decay over test administrations:

$$ M\cdot e^{-\lambda t}, $$

where $\lambda$ is a disintigration constant; and $t$ is a time interval in some unit.

The maat package uses hard constraints to perform overlap control. Three options are available:

Stopping Rule

The stopping rule describes the criteria used to terminate a CAT. The stopping rule is based on the number of overall required points and the total number of items denoted in the constraint file.

Ability Estimation

The maat package supports expected a posteriori (EAP), maximum likelihood estimation (MLE) and maximum likelihood estimation with fence (MLEF) available in the TestDesign package for $\theta$ estimation. The estimation method must be specified in createShadowTestConfig().

The MLE and MLEF methods in TestDesign has extra fallback routines for performing CAT:

In a maat() simulation, two types of ability estimates are obtained after completing each module.

In each module (except for the very first), the initial estimate that is in place before administering the first item, is the final routing estimate from the previous module. The initial estimates can be manually specified for each examinee and for each module by supplying a list to the initial_theta_list argument in maat(). The list must be accessible using initial_theta_list[[module_index]][[examinee_id]]. In the example assessment structure in this document, module_index ranges from 1 to 6. The value of examinee_id is a string that is used in the examinee_list object.

Routing Policy

Transitioning between phases and between tests are governed by the rules described in this section. These so-called transition rules are generally based on theta estimates (and confidence intervals) and the cut-scores defining the performance levels for each grade. There are also restrictions that override the general rules. Two routing rules are implemented in the maat package: Confidence Interval Approach and Difficulty Percentile Approach.

Cut Scores

The cut scores for achievement levels must be defined to be able to perform routing between grades. For example, if there are four achievement levels (e.g., Beginning, Developing, Proficient, and Advanced), then three cut scores are necessary for each grade.

Routing Structure

Routing is performed between modules and also between tests. For example, routing is performed between Fall Phase 1 and Fall Phase 2, and also between Fall Phase 2 and Winter Phase 1. Because an examinee takes 6 modules in total, routing is performed 5 times for the examinee throughout the entire assessment.

The routing structure is now described. Let $G$ denote the grade of record of an examinee.

Routing Sturcture Diagram

The following diagrams visually summarize the permissible routing paths between modules and tests. The paths highlighted in red are due to the restrictions described above.

Test 1 to 2

include_graphics("routing_T1T2.svg")

Test 2 to 3

include_graphics("routing_T2T3.svg")

Test 3

include_graphics("routing_T3.svg")

Confidence Interval Routing

The examinee is routed based on the performance in each phase. The performance is quantified not as a point estimate of $\theta$, but as a confidence interval. The confidence interval approach [@kingsbury_comparison_1983, @eggen_computerized_2000] can be used with MLE scoring [@yang_effects_2006] and can be easily extended to multiple cut scores [@thompson_practitioners_2007].

In the confidence interval approach, the lower and upper bounds of the routing theta is computed as

$$\hat{\theta_{L}} = \hat{\theta} - z_{\alpha} \cdot SE(\theta),$$ and

$$\hat{\theta_{U}} = \hat{\theta} + z_{\alpha} \cdot SE(\theta),$$

where $z_{\alpha}$ is the normal deviate corresponding to a $1 - \alpha$ confidence interval, $\hat{\theta}$ is the routing theta, and $\hat{\theta_{L}}$ and $\hat{\theta_{U}}$ are lower and upper boundary theta values.

Once boundary values are calculated, $\hat{\theta_{L}}$ and $\hat{\theta_{U}}$ are used to identify the achievement level of the examinee:

Difficulty Percentile Routing

In difficulty percentile routing, prespecified cut scores are ignored. Instead, cut scores are determined based on item difficulty parameters of the current item pool for the module.

Once cut scores are calculated, the routing theta $\hat{\theta}$ is used to identify the achievement level of the examinee as:

Using the package

This section explains how to use the maat package.

Create Assessment Structure

The first step is to define an assessment structure using the createAssessmentStructure() function. In what follows, we specify 3 tests with 2 phases in each test. Route limits are specified to 1 below and 2 above to match the assessment structure diagram shown above. That is, for examinees in grade $G$, routing is limited to item pools between $G-1$ and $G+2$.

assessment_structure <- createAssessmentStructure(
  n_test  = 3,
  n_phase = 2,
  route_limit_below = 1,
  route_limit_above = 2
)

Create an examinee list

The next step is to create an examinee list using simExaminees(). An example is given below:

cor_v <- matrix(.8, 3, 3)
diag(cor_v) <- 1

set.seed(1)
examinee_list <- simExaminees(
  N = 10,
  mean_v = c(0, 0.5, 1.0),
  sd_v   = c(1, 1, 1),
  cor_v  = cor_v,
  assessment_structure = assessment_structure,
  initial_grade = "G4",
  initial_phase = "P1",
  initial_test  = "T1"
)

For each examinee we simulate three true theta values, one for each test administration. In the example above, the true theta values are drawn from a multivariate normal distribution, specified by a variance-covariance matrix with all covariances between thetas are set to $0.8$ and all variance to $1.0$.

Each argument of simExaminees() is defined as follows:

Load Module Specification Sheet

The next step is to load the module specification sheet using loadModules(). The maat package allows for using different item pools and constraints across different stages of testing. This requires a module specification sheet that contains which item pools and constraints are used for each grade, test, and phase. An example module specification sheet is displayed below:

fn <- system.file("extdata", "module_definition_MATH_normal_N500_flexible.csv", package = "maat")
d  <- read.csv(fn)
kable_styling(kable(d))

The sheet must have seven columns.

  1. Grade The grade level. This must be in the form of G?, where ? is a number.
  2. Test The test level. This must be in the form of T?, where ? is a number.
  3. Phase The phase level. This must be in the form of P?, where ? is a number.
  4. Module The module ID string.
  5. Constraints The file path of constraints data. This must be readable by loadConstraints() in the TestDesign package.
  6. ItemPool The file path of item pool data. This must be readable by loadItemPool() in TestDesign package.
  7. ItemAttrib The file path of item attributes data. This must be readable by loadItemAttrib() in the TestDesign package.
  8. PassageAttrib (Optional) The file path of passage attributes data. This must be readable by loadStAttrib() in the TestDesign package.

Load the module specification sheet using loadModules().

fn <- system.file("extdata", "module_definition_MATH_normal_N500_flexible.csv", package = "maat")
module_list <- loadModules(
  fn = fn,
  base_path = system.file(package = "maat"),
  assessment_structure = assessment_structure,
  examinee_list = examinee_list
)

Load Cut Scores

Cut scores must be stored in a list object. For each grade, at least two cut scores must exist. When the number of cut scores for a single grade is more than two, only the first and the last entry is used. An example is given below:

cut_scores <- list(
  G3 = c(-1.47, -0.55, 0.48),
  G4 = c(-1.07, -0.15, 0.88),
  G5 = c(-0.67,  0.25, 1.28),
  G6 = c(-0.27,  0.65, 1.68),
  G7 = c( 0.13,  1.05, 2.08),
  G8 = c( 0.53,  1.45, 2.48)
)

Create Shared Config

The next step is to create a config object using createShadowTestConfig() in the TestDesign package to set various shadow test configuration options. For example, the final theta estimation method in final_theta$method can be set to EAP, MLE or MLEF.

The exclude policy in exclude_policy$method must be set to SOFT to use the Big M method discussed in the Overlap Control section above. The value in exclude_policy$M is the penalty value.

library(TestDesign)
config <- createShadowTestConfig(
  interim_theta = list(method = "MLE"),
  final_theta = list(method = "MLE"),
  exclude_policy = list(method = "SOFT", M = 100)
)

Run the Main Simulation

The final step is to run the main simulation using maat().

set.seed(1)
maat_output_CI <- maat(
  examinee_list          = examinee_list,
  assessment_structure   = assessment_structure,
  module_list            = module_list,
  config                 = config,
  cut_scores             = cut_scores,
  overlap_control_policy = "within_test",
  transition_policy      = "CI",
  combine_policy         = "conditional",
  transition_CI_alpha    = 0.05
)

set.seed(1)
maat_output_difficulty <- maat(
  examinee_list          = examinee_list,
  assessment_structure   = assessment_structure,
  module_list            = module_list,
  config                 = config,
  cut_scores             = cut_scores,
  overlap_control_policy = "within_test",
  transition_policy      = "pool_difficulty_percentile",
  combine_policy         = "conditional",
  transition_CI_alpha         = 0.05,
  transition_percentile_lower = 0.05,
  transition_percentile_upper = 0.95
)

Plot the Module Routes

Route Diagram for the CI Transition Policy

plot(maat_output_CI, type = "route")

Route Diagram for the pool_difficulty_percentile Transition Policy

plot(maat_output_difficulty, type = "route")

Scatterplot

Scatterplot for the CI Transition Policy

plot(
  x           = maat_output_CI,
  type        = "correlation",
  theta_range = c(-4, 4),
  main        = c("Fall", "Winter", "Spring"))

Scatterplot for the pool_difficulty_percentile Transition Policy

plot(
  x           = maat_output_difficulty,
  type        = "correlation",
  theta_range = c(-4, 4),
  main        = c("Fall", "Winter", "Spring"))

Audit plot

plot(
  x = maat_output_CI,
  type = "audit",
  examinee_id = 1
)

References



Try the maat package in your browser

Any scripts or data that you put into this service are public.

maat documentation built on May 18, 2022, 9:07 a.m.