knitr::opts_chunk$set(echo = TRUE, warning = FALSE, message = FALSE, fig.width = 10, fig.height = 6) library(magrittr) library(IRRsim) library(knitr) library(ggplot2) set.seed(2112) linebreak <- "<br/>"
Percent Rater Agreement (PRA)
Intraclass Correlation (ICC)
Cohen's Kappa
Fleiss' Kappa
Models:
Number of measurements:
Consistency or absolute agreement:
| IRR Statistic | Description | Formula | |:------------------|:----------------------|:---------------------------------:| | Percent Agreement | One-way random effects; Absolute agreement | $\frac{number\ of\ observations\ agreed\ upon}{total\ number\ of\ observations}$ | | ICC(1,1) | One-way random effects; absolute agreement; single measurements | $\frac{MS_R - MS_W}{MS_R + (k - 1)MS_W}$ | | ICC(2,1) | Two-way random effects; absolute agreement; single measures | $\frac{MS_R - MS_W}{MS_R + (k - 1)MS_E + \frac{k}{n}(MS_C - MS_E)}$ | | ICC(3,1) | Two-way random mixed effects; consistency; single measures. | $\frac{MS_R - MS_E}{MS_R + (k-1)MS_E}$ | | ICC(1,k) | One-way random effects; absolute agreement; average measures. | $\frac{MS_R - MS_W}{MS_R}$ | | ICC(2,k) | Two-way random effects; absolute agreement; average measures. | $\frac{MS_R - MS_E}{MS_R | \frac{MS_C - MS_E}{n}}$ | | ICC(3,k) | Two-way mixed effects; consistency; average measures. | $\frac{MS_R - MS_E}{MS_R}$ | | Cohen's Kappa | Absolute agreement | $\frac{P_o - P_e}{1 - P_e}$ |
Note. $MS_R$ = mean square for rows; $MS_W$ = mean square for residual sources of variance; $MS_E$ = mean square error; $MS_C$ = mean square for columns; $P_o$ = observed agreement rates; $P_e$ = expected agreement rates.
data("IRRguidelines") for(i in IRRguidelines[1:7]) { cat(paste0("", i$reference, ' (<b>', ifelse(any(is.na(i$metrics)), 'Not specified', paste0(i$metrics, collapse=', ')), "</b>)\n")) cat(paste0("<li>< ", i$breaks[2], ' ', names(i$breaks)[1]), '</li>\n') if(length(i$breaks) > 2) { for(j in seq(2, length(i$breaks) - 1)) { cat(paste0("<li>", i$breaks[j], ' - ', i$breaks[j+1], ' ', names(i$breaks)[j]), '</li>\n') } } cat(paste0("<li> >", i$breaks[length(i$breaks)], ' ', names(i$breaks[length(i$breaks)]), '</li>\n')) cat('\n\n') }
guidelines <- data.frame() for(i in IRRguidelines) { guidelines <- rbind(guidelines, data.frame( Reference = i$reference, Description = names(i$breaks), Metric = ifelse(any(is.na(i$metrics)), 'Not specified', i$metrics[1]), Value = unname(i$breaks), stringsAsFactors = FALSE )) } guidelines$Reference <- gsub('; ', '\n', guidelines$Reference) guidelines[guidelines$Description == 'Reasonable for clinical measurement',]$Description <- 'Reasonable for\nclinical measurement' ggplot(guidelines, aes(x = Reference, y = Value, label = Description)) + geom_point(shape = 17, size = 3) + geom_text(vjust = -1.5, size = 3.5) + geom_text(data=guidelines[guidelines$Value > 0,], aes(label = Value), hjust = -0.7, size = 3.5) + facet_wrap(~ Metric, nrow = 1, scales = "free_x") + ylim(c(0, 1)) + ylab("IRR Value") + # ggtitle('Guidelines for Interpreting IRR') + theme(axis.text.x = element_text(size = 7))
In education, ICC(1,1) is often the most appropriate intraclass correlation (ICC) metric to use given each subject is measured by two randomly selected raters (one-way), single measures are used, and absolute agreement is desired (Shrout & Fleiss, 1979).
Guidelines for interpreting ICC is largely found in the medical literature. When discussing to ICC(1,1), Koo and Li (2016) state that "practically, this model is rarely used in clinical reliability analysis because majority of the reliability studies typically involve the same set of raters to measure all subjects" (pp. 156-157).
The guidelines for interpreting ICC may not be appropriate for all forms of ICC. As will be shown, the magnitude of the ICC vary greatly for the same percent agreement.
The IRRsim
package provides functions to simulate various scoring designs.
devtools::install_github('jbryer/IRRsim')
library(IRRsim)
Key functions:
simulateRatingMatrix
- Simulate a single rating matrix.simulateIRR
- Simulates many scoring matrices with varying percent rater agreements. S3 methods implemented for objects returned by simulateIRR
:summary
plot
as.data.frame
IRRsim_demo
- Run an interactive Shiny application.simulateRatingMatrix
{.smaller}Simulate a single rating matrix.
For each scoring event (i.e. row within a scoring matrix):
NA
(missing).set.seed(2112) test1 <- simulateRatingMatrix( nLevels = 3, k = 6, k_per_event = 6, agree = 0.6, nEvents = 10) test1 agreement(test1)
set.seed(2112) test2 <- simulateRatingMatrix( nLevels = 3, k = 6, k_per_event = 2, agree = 0.6, nEvents = 10) test2 agreement(test2)
simulateIRR
Simulates many scoring matrices with varying percent rater agreements. For each scoring matrix, IRR statistics are calculated. This functions returns an object of type IRRsim
which has S3 methods defined for plot
, summary
, and as.data.frame
.
For the remainder of the document, we wish to estimate ICC for 6, 9, and 12 raters under the conditions of 3, 5, and 9 scoring levels.
tests.3levels <- simulateIRR(nRaters = c(6, 9, 12), nRatersPerEvent = 2, nLevels = 3) tests.5levels <- simulateIRR(nRaters = c(6, 9, 12), nRatersPerEvent = 2, nLevels = 5) tests.9levels <- simulateIRR(nRaters = c(6, 9, 12), nRatersPerEvent = 2, nLevels = 9)
Note that if the parameter parallel = TRUE
is specified, the function will use multiple cores if available.
plot(tests.3levels, stat = 'ICC1') + ylim(c(-0.2, 1.0)) + scale_color_brewer(type = 'qual')
plot(tests.5levels, stat = 'ICC1') + ylim(c(-0.2, 1.0)) + scale_color_brewer(type = 'qual')
plot(tests.9levels, stat = 'ICC1') + ylim(c(-0.2, 1.0)) + scale_color_brewer(type = 'qual')
tests.3levels.sum <- summary(tests.3levels, stat = 'ICC1', method = 'quadratic') summary(tests.3levels.sum$model[[1]]) # k = 6 raters
r round(summary(tests.3levels.sum$model[[1]])$r.squared * 100)
% of the variance in ICC(1,1) is accounted for by percent agreement!
# Uniform response distribution test1.3levels <- simulateIRR(nRaters = c(6, 9, 12), nRatersPerEvent = 2, nLevels = 3, response.probs = c(.33, .33, .33)) # Lightly skewed response distribution test4.3levels <- simulateIRR(nRaters = c(6, 9, 12), nRatersPerEvent = 2, nLevels = 3, response.probs = c(.275, .275, .45)) # Moderately skewed response distributions test2.3levels <- simulateIRR(nRaters = c(6, 9, 12), nRatersPerEvent = 2, nLevels = 3, response.probs = c(.2, .2, .6)) # Highly skewed response distributions test3.3levels <- simulateIRR(nRaters = c(6, 9, 12), nRatersPerEvent = 2, nLevels = 3, response.probs = c(.1, .1, .8))
test1.3levels.df <- as.data.frame(test1.3levels) test2.3levels.df <- as.data.frame(test2.3levels) test3.3levels.df <- as.data.frame(test3.3levels) test4.3levels.df <- as.data.frame(test4.3levels) # Combine the different response distributions test1.3levels.df$ResponseDist <- 'Uniform' test2.3levels.df$ResponseDist <- 'Moderately Skewed' test3.3levels.df$ResponseDist <- 'Highly Skewed' test4.3levels.df$ResponseDist <- 'Lightly Skewed' test.3levels.df <- rbind(test1.3levels.df, test2.3levels.df, test3.3levels.df, test4.3levels.df)
guide <- data.frame( labels = paste0(names(IRRguidelines[['Cicchetti']]$breaks[-1]), ' (', unname(IRRguidelines[['Cicchetti']]$breaks[-1]), ')'), y = unname(IRRguidelines[['Cicchetti']]$breaks[-1]), stringsAsFactors = FALSE ) ggplot(test.3levels.df, aes(x = agreement, y = ICC1, color = ResponseDist)) + geom_hline(yintercept = c(0.4, 0.6, 0.75), alpha = 0.5) + geom_text(data = guide, color = 'black', hjust = 0, vjust = -0.5, aes(label = labels, y = y, x = 0)) + geom_smooth(method = 'loess', se = FALSE, aes(linetype = factor(k))) + scale_linetype('n Raters') + # scale_color_brewer(palette = 'Set3') + xlab('Percent Agreement') + ggtitle('ICC(1,1) vs Percent Agreement', subtitle = "Cicchetti's Guidelines Provided")
# data("IRRsimData", package = "IRRsim") if(file.exists('../data-releases/IRRsimData.rda')) { load('../data-releases/IRRsimData.rda') } else { tmpfile <- tempfile() download.file('https://github.com/jbryer/IRRsim/releases/download/IRRsimData/IRRsimData.rda', tmpfile) load(tmpfile) } levels(IRRsimData$nLevels) <- c('Two', 'Three', 'Four', 'Five') IRRsimData$nLevels <- as.character(IRRsimData$nLevels) levels(IRRsimData$k) <- c('Two', 'Four', 'Eight', 'Sixteen') IRRsimData$k <- as.character(IRRsimData$k)
Simulated r prettyNum(nrow(IRRsimData), big.mark = ',')
scoring matrices
For each scoring design (i.e. $k$, $k_m$, & number of scoring levels), the following model was estimated:
$$ICC = \beta_{PRA}^2 + \beta_{PRA} + B_0$$
For ICC(1,1), ICC(2,1), and ICC(3,1), $R^2 \ge 0.80$
For ICC(1,k), ICC(2,k), and ICC(3,k), the relationship is weaker as $k_m$ approaches $k$.
However, when $k_m < \frac{k}{2}$, $R^2 \ge 0.80$.
rsquared.df <- data.frame() for(n in unique(IRRsimData$nLevels)) { tmp.n <- IRRsimData[IRRsimData$nLevels == n,] for(k in unique(tmp.n$k)) { tmp.k <- tmp.n[tmp.n$k == k,] for(k_per_event in unique(tmp.k$k_per_event)) { tmp <- tmp.k[tmp.k$k_per_event == k_per_event,] icc1.out <- lm(ICC1 ~ agreement + I(agreement^2), data = tmp) icc2.out <- lm(ICC2 ~ agreement + I(agreement^2), data = tmp) icc3.out <- lm(ICC3 ~ agreement + I(agreement^2), data = tmp) icc1k.out <- lm(ICC1k ~ agreement + I(agreement^2), data = tmp) icc2k.out <- lm(ICC2k ~ agreement + I(agreement^2), data = tmp) icc3k.out <- lm(ICC3k ~ agreement + I(agreement^2), data = tmp) rsquared.df <- rbind(rsquared.df, data.frame( nLevels = n, k = k, k_per_event = k_per_event, ICC1 = summary(icc1.out)$r.squared, ICC2 = summary(icc2.out)$r.squared, ICC3 = summary(icc3.out)$r.squared, ICC1k = summary(icc1k.out)$r.squared, ICC2k = summary(icc2k.out)$r.squared, ICC3k = summary(icc3k.out)$r.squared )) } } } r.squared.summary <- data.frame( `Mean` = apply(rsquared.df[,-c(1:3)], 2, mean), `Median` = apply(rsquared.df[,-c(1:3)], 2, median), `Minimum` = apply(rsquared.df[,-c(1:3)], 2, min), `Maximum` = apply(rsquared.df[,-c(1:3)], 2, max), stringsAsFactors = FALSE ) knitr::kable(r.squared.summary, digits = 2)
An interactive Shiny web application was developed to facilitate the simulation of scoring matrices for any particular scoring design.
IRRsim_demo()
ICC and percent agreement are highly correlated with PRA accounting for at least 80% of the variance in ICC, but more than 90% when $k_m = 2$.
Skewness in the response distribution does not appear to impact ICC. This may not be desirable. As one score increases in likelihood (say, response C in four possible scores), agreement by chance of the raters should also increase. ICC, which is suppose to account for agreement-by-chance, does not for this situation.
When $k_m = 2$ (and more generally when $k_m < \frac{k}{2}$), there is a substantial penalty in ICC as k increases.
The published guidelines for interpreting ICC may not be appropriate for one-way designs (i.e. ICC(1,1) and ICC(1,k)). For many common scoring designs in education (e.g. $k_m = 2$, $k > 20$), achieving "excellent" (Cicchetti, 2001) inter-rater reliability is not possible.
The Shiny application is a tool to help researchers interpret their ICC results in relation to the percent agreement achieved.
Assistant Professor and Associate Director
City University of New York
jason@bryer.org
DAACS was developed under grant #P116F150077 from the U.S. Department of Education. However, the contents do not necessarily represent the policy of the U.S. Department of Education, and you should not assume endorsement by the Federal Government.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.