knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)
library(CPTtools)

DiBello Framework

In educational settings, conditional probability tables (CPTs) are generally monotonic: the more of a skill that the student possess, the more likely the student is to do well when posed with a task that uses that skill. The CPTtools package contains a framework for building conditional probability tables for use in Bayes net models that statisfy the monotonicity constraint. These are generally called DiBello models after a suggestion by Lou DiBello (Almond et al, 2001; Almond et al, 2015).

DiBello's idea was that each observable outcome variable in an educational setting corresponded to a direction in latent space, which he called the effective theta (in item response theory, IRT, $\theta$ is commonly use to represent the ability being measured). DiBello's idea was to map each configuration of the parent variables (in educational settings, often representing configurations of skills the student is thought to possess) to a point in this effective theta dimension. Then standard models from IRT (e.g., the graded repsonse and generalized partial credit models) could be used to calculate the conditional probabilities to put in the table.

The general procedure has three steps:

  1. Map the state of the parent variables to points on a real number line. In particular, let $\tilde\theta_{km}$ be the number associate with State $m$ for Parent $k$. Let $\tilde\theta_{i'}$ be the vector of such variables that correspond to one row of the table.

  2. Combine the parent effective thetas using a combination function, $Z_{js}(\tilde\theta_{i'})$. This yields an effective theta for each cell of the conditional probability table.

  3. Apply a link function, $g(\cdot)$, to go from the effective thetas to the conditional probabilities.

As R is a functional language, the combination function and link function are passed as arguments to the key function: calcDPCTable or calcDPCFrame. These can be passed by name or the actual function can be passed. The CPTtools package supplies the most commonly used combination and link functions, but others are possible.

Each step is described in more detail below.

Effective Thetas

In item response theory (IRT), the scale of the latent dimension, $\theta$ is not identified. Commonly, to identify the scale, psychometricians will assume that $\theta$ has a unit normal distribution in the target population. Thus a person who is a 0 on the theta scale is at the median for the population and a person who is at 1 is better than 5/6 of the people in the target population on the target skill.

Almond et al. (2015) using equally spaced quantiles of the normal distribution for effective thetas. The function effectiveThetas() does this. It takes a single argument, the number of states and returns a vector of effective thetas.

round(effectiveThetas(2),3)
round(effectiveThetas(3),3)
round(effectiveThetas(4),3)
round(effectiveThetas(5),3)

The effective values are passed to the calcDPCTable() function via the tvals argument. This should be a list of vectors of effective thetas, one vector for each parent variable. The default value simply applies the effectiveTheta() function to the length of the number of skills in each parent variable.

The function eThetaFrame(), although designed to test/illustate combination functions, is useful for understanding effective thetas. Here the Compensatory combination function is set up to take the average of the parent values.

skill1 <- c("High","Medium","Low")
skill2 <- c("Master","Non-master")
eThetaFrame(list(S1=skill1,S2=skill2), log(c(S1=1,S2=1)), 0,"Compensatory") 

The last column gives the effective theta for the row. This is the number the corresponds to the ability of a person who has the skills marked in the row to complete the target task. For example, a person who is "High" on Skill 1 and has "Mastered" Skill 2 is 1.16 standard deviations above the average ability to achieve a good outcome on the specified observable.

Combination Functions

Lou DiBello suggested that different observable could be summarized in different ways using a combination function (or structure function, or rule), $Z_j(\cdot)$. Note that this can be different functional forms for different variables, with pedagogical experts free to choose a structure function depending on how they thought a student would approach a task.

The originally proposed structure functions were

The general signature of a structure function is Z(theta,alphas,beta) where theta is a matrix of effective values (see above), alphas is a (collection of) slope parameter(s) and beta is a (collection of) difficulty (negative intercept) parameter(s). The output should be a vector of effective theta values corresponding to the rows of theta. See the function eThetaFrame() for examples. The Compensatory() structure function is basically the linear predictor of a generalized linear model, so is the basis for understanding other combination functions. Conjunctive() and Disjunctive are variants on the idea. OffsetConjunctive() and OffsetDisjunctive() are improvements which use different sets of alphas and betas.

Compensatory

Let $\tilde{\theta_k}$ be the effective theta associated with the $k$th parent variable for a particular individual (or row in the CPT), and let $\alpha_k$ be the discrimination parameter associated with the $k$th parent variable, and let $\beta$ be a difficulty parameter. Then the combined effective theta is given as

$$ \frac{1}{\sqrt{K}} \sum_{k=1}^K \alpha_k \tilde{\theta_k} - \beta \ .$$ This is essentially a generalized linear model (in the case of binary outcome, a logistic regression). The $1/\sqrt{K}$ term is a variance stablization term: in ensures that the variance of the linear predictor is related to the average of the discriminations instead of growing as the number of parent variables increases.

This the the basic combination function, and probably the easiest to explain as intuition from regression works pretty well here. A discrimination value of 1 corresponds to average importance; higher values mean that skill is more important and lower values mean that the skill is less important. For educational models, it is customary to restrict the discriminations to be positive (this identifies the direction of the latent scale), although negative discriminations might make sense if the skill variables represent attitudes or other psychological traits or states. For that reason, log descrimination parameters are often used instead of discrimination. On the log scale, a log descrimination of 0 corresponds to average importance.

Note that the difficulty is the negative of the interecept. The value is related to the probability that a person who is average in all of the input skills has of answering the question. This is roughly on an inverse normal scale, so a 0 corresponds to a 50-50 chance of solving the problem (or obtaining that level).

The psychological intuition is that the parent variables represent skills which complement and can to a certain degree substitute for each other in solving the problem. For example, consider a Physics problem which can be solved either by working through the force vectors and Newton's laws of motion, or by writing down the engergy equations and solving hem. Students who were comfortable with both techniques would have an even better chance of success because they could solve the problem with one technique and use the other to check their work.

The function eThetaFrame() is useful for inspecting/testing the combination function. The example below shows a typical use of the combination function. Note that in each case, the combined value is a weighted average of the two inputs.

skill <- c("High","Medium","Low")
eThetaFrame(list(S1=skill,S2=skill), c(S1=1.25,S2=.75), 0.33, "Compensatory")

The term difficulty used for the negative intercept parameter has a slightly different meaning from the lay definition of difficulty. In the lay definition, a task is difficult if a typical member of the population ($\theta=0$) has a low probability of success. The difficulty parameter determines the ability level (along the effective theta dimension) where the probabiliy of success is 50/50. Thus, it is really determining demand for the skill (combination). What determines the lay difficulty is a combination of the difficulty and discrimination.

Conjunctive, Disjunctive

To get the conjunctive and disjunctive models, replace the sum in the equation above with a maximum or minimum. Thus the Conjunctive() function is: $$ \min_{k=1}^K \alpha_k \tilde{\theta_k} - \beta \ ,$$ and the Disjunctive() function is: $$ \max_{k=1}^K \alpha_k \tilde{\theta_k} - \beta \ .$$ The variance stablization term is dropped, as the min and max functions will not increase the variance as the number of parents increases.

The Pyschological justification is that all skills are necessary for the conjunctive model, so the weakest skill will drive importance. The disjunctive model corresponds to alternate solution paths. If the students knows what their strongest skills are, then these should dominate the performance.

Again, the function eThetaFrame() is used to illustrate the combination functions. The examples below show a typicals use of the conjunctive and disjunctive function. Note that in each case, the combined value is a weighted min or max of the two inputs.

skill <- c("High","Medium","Low")
eThetaFrame(list(S1=skill,S2=skill), c(S1=1.25,S2=.75), 0.33, "Conjunctive")
skill <- c("High","Medium","Low")
eThetaFrame(list(S1=skill,S2=skill), c(S1=1.25,S2=.75), 0.33, "Disjunctive")

OffsetConjunctive, OffsetDisjunctive

The interpretation of the discrimination parameters in the conjuctive model is not realistic. Consider a mathematical word problem and a model with two skills: mathematical manipulation and mathematical language. Typically, the demands on the two will be different; for example, the demand on on mathematical language might be minimal, while the demand on mathematical manipulation might be moderate. Thus, it seems natural to have two different difficulty parameters.

The Offset Conjunctive and Offset Disjunctive models use one difficulty parameter for each parent variable. To reduce the overall number of parameters, only a single common discrimination parameter is used. This parameterization is much more natural because the discrimination parameter is often related to construct irrelevant sources of variability which affect all skills equally.

The new equations are: $$ \alpha \min_{k=1}^K (\tilde{\theta_k} - \beta_k) \ ,$$ for OffsetConjunctive() and $$ \alpha \max_{k=1}^K (\tilde{\theta_k} - \beta_k) \ ,$$ for OffsetDisjunctive(). Note that the signatures of the OffsetConjunctive() and Conjunctive() functions are the same, but the former expects beta to be a vector and alphas a scalar, while the reverse is true for the latter.

skill <- c("High","Medium","Low")
eThetaFrame(list(S1=skill,S2=skill), 1.0, c(S1=0.25,S2=-0.25),
            "OffsetConjunctive")
skill <- c("High","Medium","Low")
eThetaFrame(list(S1=skill,S2=skill), 1.0, c(S1=0.25,S2=-0.25),
            "OffsetDisjunctive")

Inhibitor

The Almond, et al (2001) paper (see also Almond, et al, 2015) also included a special asymmetric combination function called the inhibitor. Once again consider a mathematical word problem written in English. Here knowledge of English is an inhibitor skill, a certain minimal amount of English is needed to understand the goals of the question. Once that threshold is met, then the other (mathematical) skills determine the probability of success. If the English language comprehension threshold is not met, then the probability of success will be low (guessing).

This can be expressed mathematically as:

$$ \begin{cases} \beta_0 & \mbox{if} \tilde{\theta_1} < \tilde{\theta_1}^ \ \alpha_2 \tilde{\theta_2} - \beta_2 & \mbox{if} \tilde{\theta_1} \ge \tilde{\theta_1}^ \ \end{cases}\ .$$

No Inhibitor() function was included in CPTtools because of the difficulty in generalizing this formula. First, the threshold parameter, $\tilde{\theta_1}^* $ doesn't fit naturally into either alphas or beta; so the signature of the function does not match. Second, the Inhibitor model does not generalize when there is more than two parent variables: another combination rule would be needed to collapse the remaining dimensions onto a single dimenson.

This is a good place to remark on the extensibility of the combination functions in the Discrete Partial Credit framework. The various functions aligned with the framework (e.g., eThetaFrame, calcDPCFrame, and mapDPC) accept a function (or a character value giving the name of a function) which does the combination. This function should have three formal parameters:

Generally, the theta parameter is generated internally by the CPTtools functions, while the alphas (or log(alphas)) and beta are passed in by the user. The Peanut package, in particular, allows associating lnAlphas and betas with a node in a graph. The alphas and beta generally have one of two shapes

The function isOffsetRule() can check whether a named rule is Offset-shpae or Compensatory-shape. There is an internal list of offset rules which can be inspected with getOffsetRule() and manipulated with setOffsetRule().

Currently, CPTtools supports the following rules:

Link Functions

In DiBello's models, the effective theta for an item represents the ability of an examinee to solve the particular problem posed in the task. This is a value that runs from negative to positive infinity, with higher values indicating a more successful outcome. The next step is to map these onto probability of success. Following the generalized linear modeling usage, these are called link functions.

DiBello's original idea was to press ideas models from IRT into service for this step. The first one implemented was Samejima's graded response model. Although this model worked well for observables, but not so well for intermediate proficiency variables. This inspired a new normal link function which worked more like a regression model. The graded response model has certain restrictions. In particular, all transitions must have the same discrimination. The partial credit link function was introduced to relax that restriction, and enabled the use of more combination rules, including different combination rules for each transition.

2PL

If the child variable only has two states, then both the graded response and generalized partial credit models collapse into the the 2-parameter logistic (2PL) model. This is a common model from item response theory (IRT), which states that the probability that Examinee $i$ gets Item $j$ correct is:

$$ P(X_{ij}|\tilde{\theta_{i}}) = \frac{\exp(D\alpha_j(\tilde{\theta_i}-\beta_j))}{1 + \exp(D\alpha_j(\tilde{\theta_i}-\beta_j))} .$$

The constant $D=1.7$ is chosen so that the logistic function and the normal ogive curve are nearly identical. This allows $\theta_i$ to be interpreted as a standard normal value, with $\theta=0$ as the population median and $\theta=1$ representing an individual one standard deviation above the median. The following example shows the curve.

inv.logit <- function (z) {1/(1+exp(-1.7*z))}
a <- 1 ## Discrimination
b <- 0 ## Difficulty
curve(inv.logit(a*(x-b)),xlim=c(-3,3),ylim=c(0,1),
      main=paste("2 Parameter Logistic: a=",round(a,2),
                 " b=",round(b,2)),
      xlab="Ability (theta)", ylab="Probability of success.")

Note that the difficulty parameter is on the same scale as the ability parameter and represents the ability at which examinees will have a 50-50 chance of success. The discrimination describes how quickly the probability rises with increasing ability, and is often related to how many non-focal knowledge, skills and abilities are required to solve the problem.

Note that the model can be rewritten as $P(X_{ij}|\tilde{\theta_{i}}) = 1/(1+\exp(-D\cdot Z_j(\tilde{\theta_i}))). Here $Z_j(\cdot)$ is the combination function, which has the difficulty and discrimination parameters built into it. This more cleanly separates the link function from the combination rules.

Graded Response

The graded response model is a generalization of the 2PL model for ordered categorical data introduces by Samejima (1969). Let the possible values for the observable $X_{ij}$ be ${0, 1, \ldots, K}$. Now, model each of the events $X_{ij} \ge k$ is modeled with a logistic curve: $$ \Pr(X_{ij} \ge k$ | \tilde{\theta_{i}}) = 1/(1+\exp(-D\cdot Z_{jk}(\tilde{\theta_{i}}))) ,$$ for $k=1, \ldots, K$. The probability that $X_{ij}=k$ can be found by differencing adjacent curves.

Generalized Parial Credit

Multiple Combination Rules

Normal Offset

CPT construction Functions

DPC

Earlier Graded Response Functions

Other Models

Peanut Framework

References

Works Cited

Almond, R.G., Mislevy, R.J., Steinberg, L.S., Yan, D. and Williamson, D.M. (2015). Bayesian Networks in Educational Assessment. Springer. Chapter 8.

Almond, R.G., DiBello, L., Jenkins, F., Mislevy, R.J., Senturk, D., Steinberg, L.S. and Yan, D. (2001) Models for Conditional Probability Tables in Educational Assessment. Artificial Intelligence and Statistics 2001 Jaakkola and Richardson (eds)., Morgan Kaufmann, 137–143.

Muraki, E. (1992). A Generalized Partial Credit Model: Application of an EM Algorithm. Applied Psychological Measurement, 16 159-176. DOI: 10.1177/014662169201600206

Samejima, F. (1969) Estimation of latent ability using a response pattern of graded scores. Psychometrika Monograph No. 17, 34, (No. 4, Part 2).

List of Symbols

List of functions



ralmond/CPTtools documentation built on Dec. 27, 2024, 7:15 a.m.