knitr::opts_chunk$set(echo = TRUE, 
                      warning = FALSE, 
                      message = FALSE, 
                      fig.width = 10,
                      fig.height = 6)

library(magrittr)
library(IRRsim)
library(knitr)
library(ggplot2)

set.seed(2112)

linebreak <- "<br/>"

Motivation

Guiding Research Questions

  1. What is the relationship between intraclass correlation (ICC) and percent rater agreement (PRA)?
  2. Are the published guidelines for interpreting ICC appropriate for all rating designs?

Measuring Inter-rater Reliability {.smaller}

Percent Rater Agreement (PRA)

Intraclass Correlation (ICC)

Cohen's Kappa

Fleiss' Kappa

Components of Measuring IRR {.smaller}

Models:

Number of measurements:

Consistency or absolute agreement:

{.smaller}

| IRR Statistic | Description | Formula | |:------------------|:----------------------|:---------------------------------:| | Percent Agreement | One-way random effects; Absolute agreement | $\frac{number\ of\ observations\ agreed\ upon}{total\ number\ of\ observations}$ | | ICC(1,1) | One-way random effects; absolute agreement; single measurements | $\frac{MS_R - MS_W}{MS_R + (k - 1)MS_W}$ | | ICC(2,1) | Two-way random effects; absolute agreement; single measures | $\frac{MS_R - MS_W}{MS_R + (k - 1)MS_E + \frac{k}{n}(MS_C - MS_E)}$ | | ICC(3,1) | Two-way random mixed effects; consistency; single measures. | $\frac{MS_R - MS_E}{MS_R + (k-1)MS_E}$ | | ICC(1,k) | One-way random effects; absolute agreement; average measures. | $\frac{MS_R - MS_W}{MS_R}$ | | ICC(2,k) | Two-way random effects; absolute agreement; average measures. | $\frac{MS_R - MS_E}{MS_R | \frac{MS_C - MS_E}{n}}$ | | ICC(3,k) | Two-way mixed effects; consistency; average measures. | $\frac{MS_R - MS_E}{MS_R}$ | | Cohen's Kappa | Absolute agreement | $\frac{P_o - P_e}{1 - P_e}$ |

Note. $MS_R$ = mean square for rows; $MS_W$ = mean square for residual sources of variance; $MS_E$ = mean square error; $MS_C$ = mean square for columns; $P_o$ = observed agreement rates; $P_e$ = expected agreement rates.

Guidelines for Interpreting IRR {.columns-2 }

data("IRRguidelines")
for(i in IRRguidelines[1:7]) {
    cat(paste0("", i$reference, ' (<b>', ifelse(any(is.na(i$metrics)), 'Not specified', paste0(i$metrics, collapse=', ')), "</b>)\n"))
    cat(paste0("<li>&lt; ", i$breaks[2], ' ', names(i$breaks)[1]), '</li>\n')
    if(length(i$breaks) > 2) {
        for(j in seq(2, length(i$breaks) - 1)) {
            cat(paste0("<li>", i$breaks[j], ' - ', i$breaks[j+1], ' ', names(i$breaks)[j]), '</li>\n')
        }
    }
    cat(paste0("<li> &gt;", i$breaks[length(i$breaks)], ' ', names(i$breaks[length(i$breaks)]), '</li>\n'))
    cat('\n\n')
}

Guidelines for Interpreting IRR

guidelines <- data.frame()
for(i in IRRguidelines) {
    guidelines <- rbind(guidelines, data.frame(
        Reference = i$reference,
        Description = names(i$breaks),
        Metric = ifelse(any(is.na(i$metrics)), 'Not specified', i$metrics[1]),
        Value = unname(i$breaks),
        stringsAsFactors = FALSE
    ))
}
guidelines$Reference <- gsub('; ', '\n', guidelines$Reference)
guidelines[guidelines$Description == 'Reasonable for clinical measurement',]$Description <- 'Reasonable for\nclinical measurement'
ggplot(guidelines, aes(x = Reference, y = Value, label = Description)) + 
    geom_point(shape = 17, size = 3) + 
    geom_text(vjust = -1.5, size = 3.5) +
    geom_text(data=guidelines[guidelines$Value > 0,],
              aes(label = Value), hjust = -0.7, size = 3.5) +
    facet_wrap(~ Metric, nrow = 1, scales = "free_x") +
    ylim(c(0, 1)) +
    ylab("IRR Value") +
    # ggtitle('Guidelines for Interpreting IRR') +
    theme(axis.text.x = element_text(size = 7))

Guidelines for Education

In education, ICC(1,1) is often the most appropriate intraclass correlation (ICC) metric to use given each subject is measured by two randomly selected raters (one-way), single measures are used, and absolute agreement is desired (Shrout & Fleiss, 1979).

Guidelines for interpreting ICC is largely found in the medical literature. When discussing to ICC(1,1), Koo and Li (2016) state that "practically, this model is rarely used in clinical reliability analysis because majority of the reliability studies typically involve the same set of raters to measure all subjects" (pp. 156-157).

The guidelines for interpreting ICC may not be appropriate for all forms of ICC. As will be shown, the magnitude of the ICC vary greatly for the same percent agreement.

The IRRsim R Package {.smaller}

The IRRsim package provides functions to simulate various scoring designs.

devtools::install_github('jbryer/IRRsim')
library(IRRsim)

Key functions:

simulateRatingMatrix {.smaller}

Simulate a single rating matrix.

For each scoring event (i.e. row within a scoring matrix):

  1. One of k raters is randomly selected.
  2. A score, y, is randomly selected from the response distribution (uniform distribution by default)
  3. A random number, x, between 0 and 1 is generated.
    • If x is less than the specified desired percent agreement, the remaining values in the row are set to y.
    • Otherwise, scores for the remaining raters are randomly selected from the response distribution.
  4. Repeat steps 1 through 3 for the remaining scoring events.
  5. If $k_m < k$, then $k - k_m$ scores per row are set to NA (missing).

Simulating Scoring Matrices { .columns-2 .smaller }

set.seed(2112)
test1 <- simulateRatingMatrix(
    nLevels = 3, k = 6, k_per_event = 6,
    agree = 0.6, nEvents = 10)
test1
agreement(test1)
set.seed(2112)
test2 <- simulateRatingMatrix(
    nLevels = 3, k = 6, k_per_event = 2,
    agree = 0.6, nEvents = 10)
test2
agreement(test2)

simulateIRR

Simulates many scoring matrices with varying percent rater agreements. For each scoring matrix, IRR statistics are calculated. This functions returns an object of type IRRsim which has S3 methods defined for plot, summary, and as.data.frame.

For the remainder of the document, we wish to estimate ICC for 6, 9, and 12 raters under the conditions of 3, 5, and 9 scoring levels.

tests.3levels <- simulateIRR(nRaters = c(6, 9, 12), nRatersPerEvent = 2, nLevels = 3)
tests.5levels <- simulateIRR(nRaters = c(6, 9, 12), nRatersPerEvent = 2, nLevels = 5)
tests.9levels <- simulateIRR(nRaters = c(6, 9, 12), nRatersPerEvent = 2, nLevels = 9)

Note that if the parameter parallel = TRUE is specified, the function will use multiple cores if available.

{.centered}

plot(tests.3levels, stat = 'ICC1') + ylim(c(-0.2, 1.0)) + scale_color_brewer(type = 'qual')

{.centered}

plot(tests.5levels, stat = 'ICC1') + ylim(c(-0.2, 1.0)) + scale_color_brewer(type = 'qual')

{.centered}

plot(tests.9levels, stat = 'ICC1') + ylim(c(-0.2, 1.0)) + scale_color_brewer(type = 'qual')

{.smaller}

tests.3levels.sum <- summary(tests.3levels, stat = 'ICC1', method = 'quadratic')
summary(tests.3levels.sum$model[[1]]) # k = 6 raters

r round(summary(tests.3levels.sum$model[[1]])$r.squared * 100)% of the variance in ICC(1,1) is accounted for by percent agreement!

Simulating Different Response Distributions {.smaller}

# Uniform response distribution
test1.3levels <- simulateIRR(nRaters = c(6, 9, 12), nRatersPerEvent = 2, nLevels = 3, 
                             response.probs = c(.33, .33, .33))
# Lightly skewed response distribution
test4.3levels <- simulateIRR(nRaters = c(6, 9, 12), nRatersPerEvent = 2, nLevels = 3, 
                             response.probs = c(.275, .275, .45))
# Moderately skewed response distributions
test2.3levels <- simulateIRR(nRaters = c(6, 9, 12), nRatersPerEvent = 2, nLevels = 3, 
                             response.probs = c(.2, .2, .6))
# Highly skewed response distributions
test3.3levels <- simulateIRR(nRaters = c(6, 9, 12), nRatersPerEvent = 2, nLevels = 3, 
                             response.probs = c(.1, .1, .8))
test1.3levels.df <- as.data.frame(test1.3levels)
test2.3levels.df <- as.data.frame(test2.3levels)
test3.3levels.df <- as.data.frame(test3.3levels)
test4.3levels.df <- as.data.frame(test4.3levels)
# Combine the different response distributions
test1.3levels.df$ResponseDist <- 'Uniform'
test2.3levels.df$ResponseDist <- 'Moderately Skewed'
test3.3levels.df$ResponseDist <- 'Highly Skewed'
test4.3levels.df$ResponseDist <- 'Lightly Skewed'
test.3levels.df <- rbind(test1.3levels.df, test2.3levels.df, test3.3levels.df, test4.3levels.df)
Uniform Distribution wzxhzdk:14 Lightly Skewed wzxhzdk:15

Moderately Skewed wzxhzdk:16 Highly Skewed wzxhzdk:17

{.centered}

guide <- data.frame(
    labels = paste0(names(IRRguidelines[['Cicchetti']]$breaks[-1]), ' (',
                    unname(IRRguidelines[['Cicchetti']]$breaks[-1]), ')'),
    y = unname(IRRguidelines[['Cicchetti']]$breaks[-1]),
    stringsAsFactors = FALSE
)
ggplot(test.3levels.df, aes(x = agreement, y = ICC1, color = ResponseDist)) +
    geom_hline(yintercept = c(0.4, 0.6, 0.75), alpha = 0.5) +
    geom_text(data = guide, color = 'black', hjust = 0, vjust = -0.5,
              aes(label = labels, y = y, x = 0)) +
    geom_smooth(method = 'loess', se = FALSE, aes(linetype = factor(k))) +
    scale_linetype('n Raters') +
    # scale_color_brewer(palette = 'Set3') +
    xlab('Percent Agreement') +
    ggtitle('ICC(1,1) vs Percent Agreement',
            subtitle = "Cicchetti's Guidelines Provided")

Modeling ICC from Percent Agreement {.columns-2 .smaller}

# data("IRRsimData", package = "IRRsim")
if(file.exists('../data-releases/IRRsimData.rda')) {
    load('../data-releases/IRRsimData.rda')
} else {
    tmpfile <- tempfile()
    download.file('https://github.com/jbryer/IRRsim/releases/download/IRRsimData/IRRsimData.rda', tmpfile)
    load(tmpfile)
}
levels(IRRsimData$nLevels) <- c('Two', 'Three', 'Four', 'Five')
IRRsimData$nLevels <- as.character(IRRsimData$nLevels)
levels(IRRsimData$k) <- c('Two', 'Four', 'Eight', 'Sixteen')
IRRsimData$k <- as.character(IRRsimData$k)

Simulated r prettyNum(nrow(IRRsimData), big.mark = ',') scoring matrices1 (each 100 x k) with k between 2 and 16 raters; between 2 and 5 scoring levels; and four response distributions (i.e. uniform, lightly skewed, moderately skewed, and highly skewed).

For each scoring design (i.e. $k$, $k_m$, & number of scoring levels), the following model was estimated:

$$ICC = \beta_{PRA}^2 + \beta_{PRA} + B_0$$

For ICC(1,1), ICC(2,1), and ICC(3,1), $R^2 \ge 0.80$

For ICC(1,k), ICC(2,k), and ICC(3,k), the relationship is weaker as $k_m$ approaches $k$.

However, when $k_m < \frac{k}{2}$, $R^2 \ge 0.80$.

rsquared.df <- data.frame()
for(n in unique(IRRsimData$nLevels)) {
    tmp.n <- IRRsimData[IRRsimData$nLevels == n,]
    for(k in unique(tmp.n$k)) {
        tmp.k <- tmp.n[tmp.n$k == k,]
        for(k_per_event in unique(tmp.k$k_per_event)) {
            tmp <- tmp.k[tmp.k$k_per_event == k_per_event,]
            icc1.out <- lm(ICC1 ~ agreement + I(agreement^2), data = tmp)
            icc2.out <- lm(ICC2 ~ agreement + I(agreement^2), data = tmp)
            icc3.out <- lm(ICC3 ~ agreement + I(agreement^2), data = tmp)
            icc1k.out <- lm(ICC1k ~ agreement + I(agreement^2), data = tmp)
            icc2k.out <- lm(ICC2k ~ agreement + I(agreement^2), data = tmp)
            icc3k.out <- lm(ICC3k ~ agreement + I(agreement^2), data = tmp)
            rsquared.df <- rbind(rsquared.df, data.frame(
                nLevels = n,
                k = k,
                k_per_event = k_per_event,
                ICC1 = summary(icc1.out)$r.squared,
                ICC2 = summary(icc2.out)$r.squared,
                ICC3 = summary(icc3.out)$r.squared,
                ICC1k = summary(icc1k.out)$r.squared,
                ICC2k = summary(icc2k.out)$r.squared,
                ICC3k = summary(icc3k.out)$r.squared
            ))
        }
    }
}
r.squared.summary <- data.frame(
    `Mean` = apply(rsquared.df[,-c(1:3)], 2, mean),
    `Median` = apply(rsquared.df[,-c(1:3)], 2, median),
    `Minimum` = apply(rsquared.df[,-c(1:3)], 2, min),
    `Maximum` = apply(rsquared.df[,-c(1:3)], 2, max),
    stringsAsFactors = FALSE
)
knitr::kable(r.squared.summary, digits = 2)

Shiny Application {.smaller}

An interactive Shiny web application was developed to facilitate the simulation of scoring matrices for any particular scoring design.

IRRsim_demo()
![](IRRsimShinyApp.png)

Discussion { .smaller }

  • ICC and percent agreement are highly correlated with PRA accounting for at least 80% of the variance in ICC, but more than 90% when $k_m = 2$.

  • Skewness in the response distribution does not appear to impact ICC. This may not be desirable. As one score increases in likelihood (say, response C in four possible scores), agreement by chance of the raters should also increase. ICC, which is suppose to account for agreement-by-chance, does not for this situation.

  • When $k_m = 2$ (and more generally when $k_m < \frac{k}{2}$), there is a substantial penalty in ICC as k increases.

  • The published guidelines for interpreting ICC may not be appropriate for one-way designs (i.e. ICC(1,1) and ICC(1,k)). For many common scoring designs in education (e.g. $k_m = 2$, $k > 20$), achieving "excellent" (Cicchetti, 2001) inter-rater reliability is not possible.

  • The Shiny application is a tool to help researchers interpret their ICC results in relation to the percent agreement achieved.

Recommendations

Thank You! {.smaller}


Jason Bryer, Ph.D.

Assistant Professor and Associate Director
City University of New York
jason@bryer.org

Website: irrsim.bryer.org

Source Code: github.com/jbryer/IRRsim





DAACS was developed under grant #P116F150077 from the U.S. Department of Education. However, the contents do not necessarily represent the policy of the U.S. Department of Education, and you should not assume endorsement by the Federal Government.



jbryer/IRRsim documentation built on April 23, 2023, 1:58 a.m.