Item Response Theory Analysis Report
In TestAnaAPP: A 'shiny' App for Test Analysis and Visualization

# mapstyles:
#       Normal: ['First Paragraph']
knitr::opts_chunk$set(echo=FALSE, warning = FALSE)
requireNamespace("officer")
requireNamespace("rmarkdown")
requireNamespace("officedown")
if (!require(latticeExtra)) {
  install.packages("latticeExtra")
  requireNamespace("latticeExtra")
}

fmt_tra_other <- function(ft, reset_width = TRUE){#Convert matrix to table.
  c <- ncol(ft)
  ft <- as.data.frame(ft)%>%flextable::flextable()
  ft <- flextable::bold(ft, part = "header")
  ft <- flextable::fontsize(ft, size = 10, part = "all")
  ft <- flextable::font(ft, fontname = "Times New Roman")
  ft <- flextable::align_nottext_col(ft, align = "center")
  # merge header
  ft <- flextable::merge_h(ft, part = "header")%>%
    flextable::merge_v(part = "header")%>%
    flextable::theme_booktabs()%>%
    flextable::bold( part = "header")%>% # Bold
    flextable::bold(j=1, part = "body")%>%
    #center
    flextable::align_text_col(align = "center")%>%
    flextable::valign(valign = "center")%>%
    flextable::padding(padding = 2)

  if(reset_width == TRUE){
    ft <- flextable::autofit(ft)%>%
    flextable::width(j = 1:c,width = (16/2.54)/c)
  }
  return(ft)
}
bold_print <- function(text){
  text_format <- officer::fp_text(bold = TRUE, 
                                  font.family = "Time News Roman",
                                  font.size = 12)
  fp <- officer::fp_par(text.align = "center")
  return(officer::fpar(officer::ftext(text, prop = text_format), fp_p = fp))
}

plot_width <- 6.5

# Variables available in this document
Response # Response data
model # Selected IRT model
IRT_est_method # Model parameter estimation method
IRT_person_est_method # Person parameter estimation method
EFA_method # EFA estimation method
rotation_method # EFA factor rotation method
IRT_select_independent # Selected indicator for independence tests
IRT_itemfit_method # Selected indicator for item fit tests
IRT_modelfit_relat # Relative fit indices
IRT_modelfit # Absolute fit indices
CTT_EFA_eigenvalues # EFA eigenvalues
IRT_Q3 # Independence test results
IRT_itemfit # Item fit results
IRT_itempar # Item parameters

r bold_print("Table of Contents")

r bold_print("List of Figures")

r bold_print("List of Tables")

\newpage

In the field of psychological and educational measurement, two commonly used measurement theories are classical test theory (CTT) and item response theory (IRT). CTT typically relies on the total score as a reflection of the test taker's underlying trait level. It possesses characteristics that are easily understandable and practical. However, CTT has inherent limitations that cannot be overcome due to its simplicity. These limitations include the high dependency of test scores on test items and the dependency of item parameters on the test sample, and the lack of comparability between item difficulty and test taker scores. On the other hand, IRT overcomes some of these limitations through extensive mathematical calculations, large samples, and strong assumptions. IRT provides estimates of item and test taker latent traits. It offers several specific advantages:

Invariance of test taker ability and item parameters.
Test taker ability and item difficulty parameters on the same scale, allowing for comparability.
IRT provides different measurement errors for test takers at different ability levels, which is more reasonable.
Others.

The 'TestAnaAPP' is an interactive program developed in the R language for test analysis. It addresses the high computational cost and complexity of IRT models and provides practitioners of test quality analysis with a comprehensive set of evaluation indicators for items. Additionally, it offers professional visualization of the results.

Introduction to IRT Models

'TestAnaAPP' offers a range of commonly used scoring models in the field of selection, catering to both dichotomous and polytomous data. The dichotomous scoring models consist of the Rasch model, the one-parameter logistic model (1PL), the two-parameter logistic model (2PL), the three-parameter logistic model (3PL), and the four-parameter logistic model (4PL). On the other hand, the polytomous scoring models encompass the graded response model (GRM), the partial credit model (PCM), and the generalized partial credit model (GPCM). These models are widely recognized and utilized in the field.

It is important to note that in IRT, the Rasch model is equivalent to the One-Parameter Logistic (1PL) model. The terms "one-parameter" and "two-parameter" refer to the number of item parameters in the IRT models. The specific models are described below:

IRT Models for Dichotomous Scoring Data

Let $X_{ij}$ denote the response of test taker $j$ (where $j = 1,2,3,...,J$) to item $i$ (where $i = 1,2,3,...,I$) in dichotomous scoring data. In this type of data, the value of $X_{ij}$ is assigned as 0 (for incorrect responses) or 1 (for correct responses). Moreover, let $\beta_i$ represent the item difficulty parameter, and $\theta_j$ represent the latent trait parameter of the test taker. In the context of IRT analysis, the difficulty parameters and the latent trait parameters of test takers typically follow a standard normal distribution, allowing for comparability. This comparability enables the determination of the difficulty level of an item for test takers at specific levels, based on the relative magnitudes of these two parameters.

The 'TestAnaAPP' program supports the following specific IRT models for dichotomous scoring data:

Rasch model (1PL)

The Rasch model, also known as the simple Rasch model, is employed for items that are scored dichotomously. The item response function (IRF) of the Rasch model was originally defined by Georg Rasch in 1960. The formula for the IRF is as follows:

$$P(X_{ij}=1 | \theta_{j}) = \frac{1}{1+e^{(-(\theta_j-\beta_i))}}\tag{1}$$ The Rasch model (1PL) assumes that the probability $P_{ij}$ of test taker $j$ answering item $i$ correctly (represented by $X_{ij}=1$) is determined by the relative difficulty of item $i$ for test taker $j$, denoted as $\theta_j-\beta_i$.

Two-parameter logistic model (2PL)

The 2PL model, originally proposed by Birnbaum in 1957, characterizes the psychometric features of items through the parameters of item difficulty ($\beta_i$) and item discrimination ($\alpha_i$). Item discrimination reflects the item's capacity to distinguish between subjects with varying levels of latent traits and is a crucial measure of item quality. The item response function (IRF) of the 2PL model is illustrated below.

$$P(X_{ij}=1 | \theta_{j}) = \frac{1}{1+e^{(-\alpha_i(\theta_j-\beta_i))}}\tag{2}$$

Three-parameter logistic model (3PL)

Birnbaum (1968) expanded upon the 2PL model by introducing the item guessing parameter, denoted as $c_i$, and proposing the 3PL model. The item guessing parameter, $c_i$, signifies the likelihood of a subject correctly guessing item $i$. In the 3PL model, the probability $P_{ij}$ that subject $j$ answers item $i$ correctly, defined as $X_{ij}=1$, is influenced by the relative difficulty of item $i$ for subject $j$ ($\theta_j-\beta_i$), the item discrimination level ($\alpha_i$), and the guessing parameter ($c_i$). The item response function (IRF) of the 3PL model is displayed below.

$$P(X_{ij}=1 | \theta_{j}) = c_i+(1-c_i)\frac{1}{1+e^{(-\alpha_i(\theta_j-\beta_i))}}\tag{3}$$

Four-parameter logistic model (4PL)

The 3PL model addresses the guessing behavior of subjects with very low ability on items, while the 4PL model expands on this by also accounting for slips made by highly capable subjects. In addition to the components of the 3PL model, the 4PL model introduces an extra item slip parameter, denoted as $1-u_i$, to represent the probability of high-ability subjects making mistakes on item $i$. The formula for the item response function (IRF) of the 4PL model is displayed below.

$$P(X_{ij}=1 | \theta_{j}) = c_i+(u_i-c_i)\frac{1}{1+e^{(-\alpha_i(\theta_j-\beta_i))}}\tag{4}$$

For more details on the 4PL model, refer to Barton and Lord (1981).

IRT Models for Polytomous Data

Let $X_{ikj}$ denote the response of subject $j$ (where $j=1,2,3,...,J$) to item $i$ (where $i=1,2,3,...,I$) in a polytomous data context, where there exist $m_i$ response categories. Specifically, $X_{ikj}=k$, where $k\in{0,1,2,...,m_i}$. Moreover, in the polytomous item response theory (IRT) model, the item threshold (or difficulty) parameters are represented as $\beta_{ih}$, with $h\in{1,2,...,m_i}$, while the latent trait parameter for the subject is denoted as $\theta_j$.

Partial credit model (PCM)

The partial credit model (PCM) is a variant of the Rasch model (Masters, 1982) specifically designed to analyze response data with more than two categories. In the PCM model, the item parameters solely consist of difficulty parameters $\beta_{ih}$. Here, $\beta_{ih}$ represents the threshold parameter (difficulty) for category $k$ on item $i$.

The item response function (IRF) for the PCM is represented by the following equation:

$$P(X_{ikj}=k | \theta_{j}) = \frac{e^{\sum_{h=0}^{k}(\theta_j-\beta_{ih})}}{\sum_{c=0}^{m_i}e^{\sum_{h=0}^{c}(\theta_j-\beta_{ih})}}\tag{5}$$

where $P(X_{ikj}=k | \theta_{j})$ represents the probability that subject $j$ responds to item $i$ with category $k$.

Generalized partial credit model (GPCM)

The generalized partial credit model (GPCM) is a modification of the PCM proposed by Muraki in 1992. The GPCM enables the estimation of item discrimination parameters, denoted as $\alpha_i$. The Item Response Function (IRF) for the GPCM model can be described as:

$$P(X_{ikj}=k | \theta_{j}) = \frac{e^{\sum_{h=0}^{k}[\alpha_i(\theta_j-\beta_{ih})]}}{\sum_{c=0}^{m_i}e^{\sum_{h=0}^{c}[\alpha_i(\theta_j-\beta_{ih})]}}\tag{6}$$

Graded response model (GRM)

Both the PCM and GPCM models are founded on the modeling of the probability of response categories $k$ for item $i$. In contrast, the GRM model estimates the cumulative probability of response categories $k$ or higher. The cumulative response probability function for the GRM model can be expressed as:

$$P(X_{ij}{\geq} k | \theta_{j}) = \frac{e^{[\alpha_i(\theta_j-\beta_{ik})]}}{1+e^{[\alpha_i(\theta_j-\beta_{ik})]}}\tag{7}$$ where $P(X_{ij}{\geq} k | \theta_{j})$ represents the probability that subject $j$ responds to item $i$ with category $k$ or a higher category. The parameter $\beta_{ik}$ corresponds to the threshold (difficulty) of response category $k$ on item $i$. To calculate the probability of a specific response category, it is necessary to subtract the adjoining cumulative probabilities:

$$P(X_{ij}= k | \theta_{j})=P(X_{ij}{\geq} k | \theta_{j})-P(X_{ij}{\geq} k+1 | \theta_{j})\tag{8}$$ It is important to note that the GRM model, in contrast to the GPCM and PCM models, assumes the inequality $\beta_{ik}{\geq}{\beta_{i(k-1)}}$. This inequality indicates that the difficulty of higher response categories is consistently greater than that of lower categories.

Selected Model and Settings

The results presented in this document were estimated based on the settings you selected in the interactive 'TestAnaAPP' program, along with default options within the program. The following settings were employed:

IRT model: r model；
Model parameter estimation method: r IRT_est_method；
Person parameter estimation method: r IRT_person_est_method；
EFA estimation method: r EFA_method；
EFA factor rotation method: r rotation_method；
Selected indicator for independence tests: r IRT_select_independent；
Selected indicator for item fit tests: r IRT_itemfit_method；

Model Fit

The model's fitting results included both relative fit indices and absolute fit indices. Relative fit indices serve as a means to compare the level of fit between models. When two IRT models are used to fit the same dataset and produce different sets of relative fit index values, the model with the lower relative fit index value is generally considered to fit the data comparatively better. Absolute fit indices reflect the degree of absolute fit between the model and the data. If the absolute fit indices fall within an acceptable range, it suggests that the model fits the data well.

Relative Fit Indices

The 'TestAnaAPP' program utilizes the 'mirt' package for parameter estimation and calculates relative fit indices, including AIC, SABIC, HQ, BIC, and Log-likelihood. Smaller values of AIC, SABIC, HQ, and BIC indicate a higher degree of model fit. A greater value of Log-likelihood signifies a higher degree of model fit. To employ relative fit indices for IRT model selection, you can opt for various models during parameter estimation in 'TestAnaAPP' and record the corresponding relative fit index magnitudes.

fmt_tra_other(ft = IRT_modelfit_relat)

Absolute Fit Indices

'TestAnaAPP' generates absolute fit indices, including M2, RMSEA, SRMSR, TLI, and CFI.

M2: If the corresponding p-value is less than 0.05 (α<0.05), the null hypothesis is rejected, suggesting an inadequate fit of the model to the data.
RMSEA: Browne and Cudeck (1993) provided a well-regarded guide to interpreting RMSEA. A value below 0.05 indicates a very good fit (close fit), 0.05 to 0.08 suggests a relatively good fit (fair fit), 0.08 to 0.10 implies a mediocre fit, and a value above 0.10 indicates a poor fit.
SRMSR: A value of SRMSR≤0.05 indicates an acceptable fit between the model and the data.
TLI and CFI values greater than 0.9 suggest a satisfactory fit of the model to the data.

fmt_tra_other(ft = IRT_modelfit)

Hypothesis Testing

Unidimensionality Test

Common methods used for testing unidimensionality include exploratory and confirmatory factor analysis. The software 'TestAnaAPP' generates exploratory factor analysis results for the unidimensionality testing. To assess conformity to the unidimensionality hypothesis, the criterion is that the eigenvalue of the first factor is at least 3-5 times greater than the eigenvalue of the second factor. If r EFA_method is used for exploratory factor analysis, the eigenvalue of the first factor is r round(CTT_EFA_eigenvalues[1,2]/CTT_EFA_eigenvalues[2,2], 2) times higher than the eigenvalue of the second factor. Therefore, r ifelse(CTT_EFA_eigenvalues[1,2]/CTT_EFA_eigenvalues[2,2]>=3,"it conforms to the unidimensionality hypothesis.", "it does not conform to the unidimensionality hypothesis and multidimensional IRT should be used for analysis.")

fmt_tra_other(ft = CTT_EFA_eigenvalues)

Independence Test

'TestAnaAPP' provides the Q3 and LD-X2 statistics for independent testing in IRT. Larger absolute values indicate stronger dependencies between the two items. Due to the lack of uniform standards across different studies, we do not provide the cutoff values for Q3 and LD-X2. Readers are encouraged to refer to professional books or papers for data interpretation.

In this analysis, you selected r IRT_select_independent as the statistic for the independence test.

Q3_color <- function(x, value = IRTreport_Q3_highlight){
  out <- rep("black", length(x))
  out[abs(x) > value] <- "red"
  out[abs(x) == 1] <- "black"
  out
}
IRT_Q3_print <- cbind("Item" = colnames(IRT_Q3), IRT_Q3) %>% as.data.frame()

fmt_tra_other(IRT_Q3_print,  reset_width = FALSE) %>% 
  flextable::color(j = 2:ncol(IRT_Q3_print),
                       color = Q3_color)

Note: The red color indicates that the absolute value of the Q3 statistic exceeds the threshold value you set on TestAnaAPP.

Item Fit

The item fit test is utilized to assess the existence of a significant disparity between the predicted model responses and the actual responses. 'TestAnaAPP' provides the $X^2$ and $G2$ statistics for performing the item fit test. A $P$ value less than 0.05 denotes a significant disparity between the predicted model values and the actual data. Furthermore, an RMSEA value greater than 0.1 suggests a substantial discrepancy between the predicted values and the actual data. A comprehensive assessment can be made by considering the test situation and statistical indicators. Selecting an alternative model for the fit could potentially yield divergent outcomes.

fmt_tra_other(IRT_itemfit)

Item Parameters

Item parameters serve as crucial indicators for assessing item quality. Provided below are the criteria, derived from "The Basics of Item Response Theory Using R", that can be utilized to evaluate item discrimination values. Readers can employ these criteria to evaluate the extent of item discrimination in the test under analysis.

`r bold_print("Table: Labels for Item Discrimination Values")``

| Verbal label | Range of values | Typical Value| |:------------:|:---------------:|:------------:| | None | 0 | 0.00 | | Very low | 0.01-0.34 | 0.18 | | Low | 0.35-0.64 | 0.50 | | Moderate | 0.65-1.34 | 1.00 | | High | 1.35-1.69 | 1.50 | | Very high | >1.70 | 2.00 | | Perfect | $+\infty$ | $+\infty$ |

The r model model was chosen for this analysis. Parameter estimation was performed using the r IRT_est_method. The item parameters are presented below:

disc_color <- function(x, value = IRTreport_alpha_highlight){
  out <- rep("black", length(x))
  out[x < value] <- "red"
  out
}
# diff_color <- function(x){
#   out <- rep("white", length(x))
#   out[abs(x) > 4 ] <- "yellow"
#   out
# }
IRT_itempar_print <- cbind("Item" = IRT_itemfit[,1], IRT_itempar)

fmt_tra_other(IRT_itempar_print) %>% 
  flextable::color(j = colnames(IRT_itempar_print)%>%
                     str_which(pattern = "Discrimination"),
                   color = disc_color) # %>%
  # flextable::highlight(j = colnames(IRT_itempar)%>%
  #                        str_which(pattern = "Difficult"),color = diff_color)

Note: The red color indicates that the item discrimination value is below the threshold value you set on TestAnaAPP.

Wright Map

IRT assesses item difficulty and participant latent traits on the same scale, enabling comparability. This provides a significant advantage compared to CTT. The Wright map serves as a visual tool for comparing the distribution of item difficulty and latent traits. By comparing the scatter plot distribution of item difficulty with latent trait values, the relative difficulty or ease of the test for the subjects, as well as the difficulty level of the test, can be determined. Additionally, it enables the identification of harder and easier items.

It is important to note that the test can effectively differentiate the subjects' latent trait levels only when it possesses appropriate difficulty. Otherwise, ceiling or floor effects may arise, whereby the measurement tool inaccurately assesses the subjects' true abilities, resulting in all subjects receiving either high or low scores.

print(IRT_wright)

Item Characteristic Curve (ICC)

The Item Characteristic Curve (ICC) visualizes the probability of subjects obtaining a specific score based on the predictions of the statistical model.

It allows us to observe the probability of subjects with different levels of latent traits obtaining a specific score on each item.

Meanwhile, it provides insights into the characteristics of items for subjects with varying levels of latent traits.

Detailed information on interpreting the ICC can be found in professional books.

print(IRT_ICC)

Item Information Curve (IIC)

Item information represents the level of measurement error associated with each item. Higher values indicate a greater contribution of the item to the accurate measurement of the latent trait of the subjects. The value of item information is often related to the discriminatory power and difficulty level of the item. The Item Information Curve (IIC) visually represents the item information and allows readers to assess the contribution of the items to the accurate measurement of a specific level of latent trait.

Professional books can provide guidance on interpreting the IIC curve for those seeking to gain a comprehensive understanding.

print(IRT_IIC)

Test Information Curve (TIC)

Test information is a quantitative measure of test accuracy in IRT, calculated by summing the information obtained from each item. A higher test information value indicates a more precise measurement of the latent trait for the subjects being assessed. 'TestAnaAPP' offers test information curves (TIC) and measurement error curves as graphical representations to demonstrate the quality of the test. A consistent conversion relationship exists between test information ($I(\theta)$) and measurement error ($SE$), which can be expressed by the following formula:

$$SE = 1/\sqrt{I(\theta)}$$

print(IRT_TIC)

It is important to note that users can download the item information and test information from the user interface. This allows them to create custom figures using the downloaded files.

Note: 'TestAnaAPP' is mainly designed to provide convenient item and test analysis programs for practitioners in psychological and educational measurement. The purpose of providing this report is to present the analysis results completely and clearly, and to provide universal and optimal result interpretation is not the primary objective of 'TestAnaAPP'. Therefore, while we hope users can benefit from the user-friendly nature of this program, we also encourage maintaining a cautious attitude toward the interpretation of results to avoid potential biases. Additionally, if users discover any errors in this program, we would greatly appreciate your contact with us (jiangyouxiang34@163.com).

\newpage

Reference

Barton, M. A., & Lord, F. M. (1981). An upper asymptote for the three-parameter logistic item response model (No. 150–453). Princeton, NJ: Educational Testing Service.
Birnbaum, A.(1957). Efficient design and use of tests of a mental ability for various decision making problems (Series Report No. 58-16). Randolph Air Force Base, TX: USAF School of Aviation Medicine.
Birnbaum, A. (1968). Some latent trait models and their use in inferring an examinee's ability. In F. M. Lord & M. R. Novick (Eds.), Statistical theories of mental test scores (pp. 397--479). Reading, MA: Addison-Wesley.
Browne, M. W., & Cudeck, R. (1993). Alternative ways of assessing model fit. In K. A. Bollen & J. S. Long (Eds.), Testing structural equation models (pp. 136–162). Newbury Park, CA: Sage.
Masters, G. N. (1982). A Rasch model for partial credit scoring. Psychometrika, 47, 149–174.
Muraki, E. (1992). A generalized partial credit model: Application of an EM algorithm. Applied Psychological Measurement, 16, 159–176.
Rasch, G. (1960). Probabilistic Models for Some Intelligence and Attainment Tests. Denmarks Paedagogiske Institut, Copenhagen. Republished in 1980 by the University of Chicago Press.

Any scripts or data that you put into this service are public.

TestAnaAPP documentation built on April 3, 2025, 10:30 p.m.

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

TestAnaAPP
A 'shiny' App for Test Analysis and Visualization

Item Response Theory Analysis Report
In TestAnaAPP: A 'shiny' App for Test Analysis and Visualization

Introduction to IRT Models

IRT Models for Dichotomous Scoring Data

Rasch model (1PL)

Two-parameter logistic model (2PL)

Three-parameter logistic model (3PL)

Four-parameter logistic model (4PL)

IRT Models for Polytomous Data

Partial credit model (PCM)

Generalized partial credit model (GPCM)

Graded response model (GRM)

Selected Model and Settings

Model Fit

Relative Fit Indices

Absolute Fit Indices

Hypothesis Testing

Unidimensionality Test

Independence Test

Item Fit

Item Parameters

Wright Map

Item Characteristic Curve (ICC)

Item Information Curve (IIC)

Test Information Curve (TIC)

Reference

Try the TestAnaAPP package in your browser

R Package Documentation

Browse R Packages

We want your feedback!

TestAnaAPP A 'shiny' App for Test Analysis and Visualization

Item Response Theory Analysis Report In TestAnaAPP: A 'shiny' App for Test Analysis and Visualization

Introduction to IRT Models

IRT Models for Dichotomous Scoring Data

Rasch model (1PL)

Two-parameter logistic model (2PL)

Three-parameter logistic model (3PL)

Four-parameter logistic model (4PL)

IRT Models for Polytomous Data

Partial credit model (PCM)

Generalized partial credit model (GPCM)

Graded response model (GRM)

Selected Model and Settings

Model Fit

Relative Fit Indices

Absolute Fit Indices

Hypothesis Testing

Unidimensionality Test

Independence Test

Item Fit

Item Parameters

Wright Map

Item Characteristic Curve (ICC)

Item Information Curve (IIC)

Test Information Curve (TIC)

Reference

Try the TestAnaAPP package in your browser

R Package Documentation

Browse R Packages

We want your feedback!

TestAnaAPP
A 'shiny' App for Test Analysis and Visualization

Item Response Theory Analysis Report
In TestAnaAPP: A 'shiny' App for Test Analysis and Visualization