In jbryer/IRRsim: Simulate Inter-Rater Reliability Statistics

knitr::opts_chunk$set(echo = TRUE, 
                      warning = FALSE, 
                      message = FALSE, 
                      fig.width = 10,
                      fig.height = 6)

library(magrittr)
library(IRRsim)
library(knitr)
library(ggplot2)

set.seed(2112)

linebreak <- "<br/>"

Motivation

The Diagnostic Assessment and Achievement of College Skills (DAACS; www.DAACS.net) is a suite of technological and social supports to optimize student learning. DAACS provides personalized feedback about students’ strengths and weaknesses in terms of key academic (mathematics, reading, & writing) and self-regulated learning skills, linking them to the resources to help them be successful students.
For the writing assessment, were trying to establish inter-rater reliability in two ways:
1. Human-to-human
2. Human-to-machine
Literature clearly suggests ICC is the appropriate measure, however most of that research is from the medical literature.
Guidance on interpretation of ICC is not very clear.

Guiding Research Questions

What is the relationship between intraclass correlation (ICC) and percent rater agreement (PRA)?
Are the published guidelines for interpreting ICC appropriate for all rating designs?

Measuring Inter-rater Reliability {.smaller}

Percent Rater Agreement (PRA)

Percentage of scoring events where raters' scores agree.

Intraclass Correlation (ICC)

ICC is a quantitative measurements are made on units that are organized into groups. It describes how strongly units in the same group resemble each other

Cohen's Kappa

Measure of agreement between two raters that takes into account agreement by chance.

Fleiss' Kappa

Extension of Cohen's kappa for more than two raters. It also takes into account agreement by chance.

Components of Measuring IRR {.smaller}

Models:

One-way random effects: each subject is measured by a different set of k randomly selected raters.
Two-way random effects: k raters are randomly selected, then, each subject is measured by the same set of k raters.
Two-way mixed effects: k fixed raters are defined. Each subject is measured by the k raters.

Number of measurements:

Single measures: even though more than one measure is taken in the experiment, reliability is applied to a context where a single measure of a single rater will be performed.
Average measures: the reliability is applied to a context where measures of k raters will be averaged for each subject.

Consistency or absolute agreement:

Absolute agreement: the agreement between two raters is of interest, including systematic errors of both raters and random residual errors.
Consistency: in the context of repeated measurements by the same rater, systematic errors of the rater are canceled and only the random residual error is kept.

{.smaller}

| IRR Statistic | Description | Formula | |:------------------|:----------------------|:---------------------------------:| | Percent Agreement | One-way random effects; Absolute agreement | $\frac{number\ of\ observations\ agreed\ upon}{total\ number\ of\ observations}$ | | ICC(1,1) | One-way random effects; absolute agreement; single measurements | $\frac{MS_R - MS_W}{MS_R + (k - 1)MS_W}$ | | ICC(2,1) | Two-way random effects; absolute agreement; single measures | $\frac{MS_R - MS_W}{MS_R + (k - 1)MS_E + \frac{k}{n}(MS_C - MS_E)}$ | | ICC(3,1) | Two-way random mixed effects; consistency; single measures. | $\frac{MS_R - MS_E}{MS_R + (k-1)MS_E}$ | | ICC(1,k) | One-way random effects; absolute agreement; average measures. | $\frac{MS_R - MS_W}{MS_R}$ | | ICC(2,k) | Two-way random effects; absolute agreement; average measures. | $\frac{MS_R - MS_E}{MS_R | \frac{MS_C - MS_E}{n}}$ | | ICC(3,k) | Two-way mixed effects; consistency; average measures. | $\frac{MS_R - MS_E}{MS_R}$ | | Cohen's Kappa | Absolute agreement | $\frac{P_o - P_e}{1 - P_e}$ |

Note. $MS_R$ = mean square for rows; $MS_W$ = mean square for residual sources of variance; $MS_E$ = mean square error; $MS_C$ = mean square for columns; $P_o$ = observed agreement rates; $P_e$ = expected agreement rates.

Guidelines for Interpreting IRR {.columns-2 }

data("IRRguidelines")
for(i in IRRguidelines[1:7]) {
    cat(paste0("", i$reference, ' (<b>', ifelse(any(is.na(i$metrics)), 'Not specified', paste0(i$metrics, collapse=', ')), "</b>)\n"))
    cat(paste0("<li>&lt; ", i$breaks[2], ' ', names(i$breaks)[1]), '</li>\n')
    if(length(i$breaks) > 2) {
        for(j in seq(2, length(i$breaks) - 1)) {
            cat(paste0("<li>", i$breaks[j], ' - ', i$breaks[j+1], ' ', names(i$breaks)[j]), '</li>\n')
        }
    }
    cat(paste0("<li> &gt;", i$breaks[length(i$breaks)], ' ', names(i$breaks[length(i$breaks)]), '</li>\n'))
    cat('\n\n')
}

Guidelines for Interpreting IRR

guidelines <- data.frame()
for(i in IRRguidelines) {
    guidelines <- rbind(guidelines, data.frame(
        Reference = i$reference,
        Description = names(i$breaks),
        Metric = ifelse(any(is.na(i$metrics)), 'Not specified', i$metrics[1]),
        Value = unname(i$breaks),
        stringsAsFactors = FALSE
    ))
}
guidelines$Reference <- gsub('; ', '\n', guidelines$Reference)
guidelines[guidelines$Description == 'Reasonable for clinical measurement',]$Description <- 'Reasonable for\nclinical measurement'
ggplot(guidelines, aes(x = Reference, y = Value, label = Description)) + 
    geom_point(shape = 17, size = 3) + 
    geom_text(vjust = -1.5, size = 3.5) +
    geom_text(data=guidelines[guidelines$Value > 0,],
              aes(label = Value), hjust = -0.7, size = 3.5) +
    facet_wrap(~ Metric, nrow = 1, scales = "free_x") +
    ylim(c(0, 1)) +
    ylab("IRR Value") +
    # ggtitle('Guidelines for Interpreting IRR') +
    theme(axis.text.x = element_text(size = 7))

Guidelines for Education

In education, ICC(1,1) is often the most appropriate intraclass correlation (ICC) metric to use given each subject is measured by two randomly selected raters (one-way), single measures are used, and absolute agreement is desired (Shrout & Fleiss, 1979).

Guidelines for interpreting ICC is largely found in the medical literature. When discussing to ICC(1,1), Koo and Li (2016) state that "practically, this model is rarely used in clinical reliability analysis because majority of the reliability studies typically involve the same set of raters to measure all subjects" (pp. 156-157).

The guidelines for interpreting ICC may not be appropriate for all forms of ICC. As will be shown, the magnitude of the ICC vary greatly for the same percent agreement.

The IRRsim R Package {.smaller}

The IRRsim package provides functions to simulate various scoring designs.

devtools::install_github('jbryer/IRRsim')

library(IRRsim)

Key functions:

simulateRatingMatrix - Simulate a single rating matrix.
simulateIRR - Simulates many scoring matrices with varying percent rater agreements. S3 methods implemented for objects returned by simulateIRR:
- summary
- plot
- as.data.frame
IRRsim_demo - Run an interactive Shiny application.

`simulateRatingMatrix` {.smaller}

Simulate a single rating matrix.

For each scoring event (i.e. row within a scoring matrix):

One of k raters is randomly selected.
A score, y, is randomly selected from the response distribution (uniform distribution by default)
A random number, x, between 0 and 1 is generated.
- If x is less than the specified desired percent agreement, the remaining values in the row are set to y.
- Otherwise, scores for the remaining raters are randomly selected from the response distribution.
Repeat steps 1 through 3 for the remaining scoring events.
If $k_m < k$, then $k - k_m$ scores per row are set to NA (missing).

Simulating Scoring Matrices { .columns-2 .smaller }

set.seed(2112)
test1 <- simulateRatingMatrix(
    nLevels = 3, k = 6, k_per_event = 6,
    agree = 0.6, nEvents = 10)
test1
agreement(test1)

set.seed(2112)
test2 <- simulateRatingMatrix(
    nLevels = 3, k = 6, k_per_event = 2,
    agree = 0.6, nEvents = 10)
test2
agreement(test2)

`simulateIRR`

Simulates many scoring matrices with varying percent rater agreements. For each scoring matrix, IRR statistics are calculated. This functions returns an object of type IRRsim which has S3 methods defined for plot, summary, and as.data.frame.

For the remainder of the document, we wish to estimate ICC for 6, 9, and 12 raters under the conditions of 3, 5, and 9 scoring levels.

tests.3levels <- simulateIRR(nRaters = c(6, 9, 12), nRatersPerEvent = 2, nLevels = 3)
tests.5levels <- simulateIRR(nRaters = c(6, 9, 12), nRatersPerEvent = 2, nLevels = 5)
tests.9levels <- simulateIRR(nRaters = c(6, 9, 12), nRatersPerEvent = 2, nLevels = 9)

Note that if the parameter parallel = TRUE is specified, the function will use multiple cores if available.

{.centered}

plot(tests.3levels, stat = 'ICC1') + ylim(c(-0.2, 1.0)) + scale_color_brewer(type = 'qual')

{.centered}

plot(tests.5levels, stat = 'ICC1') + ylim(c(-0.2, 1.0)) + scale_color_brewer(type = 'qual')

{.centered}

plot(tests.9levels, stat = 'ICC1') + ylim(c(-0.2, 1.0)) + scale_color_brewer(type = 'qual')

{.smaller}

tests.3levels.sum <- summary(tests.3levels, stat = 'ICC1', method = 'quadratic')
summary(tests.3levels.sum$model[[1]]) # k = 6 raters

r round(summary(tests.3levels.sum$model[[1]])$r.squared * 100)% of the variance in ICC(1,1) is accounted for by percent agreement!

Simulating Different Response Distributions {.smaller}

# Uniform response distribution
test1.3levels <- simulateIRR(nRaters = c(6, 9, 12), nRatersPerEvent = 2, nLevels = 3, 
                             response.probs = c(.33, .33, .33))
# Lightly skewed response distribution
test4.3levels <- simulateIRR(nRaters = c(6, 9, 12), nRatersPerEvent = 2, nLevels = 3, 
                             response.probs = c(.275, .275, .45))
# Moderately skewed response distributions
test2.3levels <- simulateIRR(nRaters = c(6, 9, 12), nRatersPerEvent = 2, nLevels = 3, 
                             response.probs = c(.2, .2, .6))
# Highly skewed response distributions
test3.3levels <- simulateIRR(nRaters = c(6, 9, 12), nRatersPerEvent = 2, nLevels = 3, 
                             response.probs = c(.1, .1, .8))

test1.3levels.df <- as.data.frame(test1.3levels)
test2.3levels.df <- as.data.frame(test2.3levels)
test3.3levels.df <- as.data.frame(test3.3levels)
test4.3levels.df <- as.data.frame(test4.3levels)
# Combine the different response distributions
test1.3levels.df$ResponseDist <- 'Uniform'
test2.3levels.df$ResponseDist <- 'Moderately Skewed'
test3.3levels.df$ResponseDist <- 'Highly Skewed'
test4.3levels.df$ResponseDist <- 'Lightly Skewed'
test.3levels.df <- rbind(test1.3levels.df, test2.3levels.df, test3.3levels.df, test4.3levels.df)

Uniform Distribution wzxhzdk:14 Lightly Skewed wzxhzdk:15

Moderately Skewed wzxhzdk:16 Highly Skewed wzxhzdk:17

{.centered}

guide <- data.frame(
    labels = paste0(names(IRRguidelines[['Cicchetti']]$breaks[-1]), ' (',
                    unname(IRRguidelines[['Cicchetti']]$breaks[-1]), ')'),
    y = unname(IRRguidelines[['Cicchetti']]$breaks[-1]),
    stringsAsFactors = FALSE
)
ggplot(test.3levels.df, aes(x = agreement, y = ICC1, color = ResponseDist)) +
    geom_hline(yintercept = c(0.4, 0.6, 0.75), alpha = 0.5) +
    geom_text(data = guide, color = 'black', hjust = 0, vjust = -0.5,
              aes(label = labels, y = y, x = 0)) +
    geom_smooth(method = 'loess', se = FALSE, aes(linetype = factor(k))) +
    scale_linetype('n Raters') +
    # scale_color_brewer(palette = 'Set3') +
    xlab('Percent Agreement') +
    ggtitle('ICC(1,1) vs Percent Agreement',
            subtitle = "Cicchetti's Guidelines Provided")

Modeling ICC from Percent Agreement {.columns-2 .smaller}

# data("IRRsimData", package = "IRRsim")
if(file.exists('../data-releases/IRRsimData.rda')) {
    load('../data-releases/IRRsimData.rda')
} else {
    tmpfile <- tempfile()
    download.file('https://github.com/jbryer/IRRsim/releases/download/IRRsimData/IRRsimData.rda', tmpfile)
    load(tmpfile)
}
levels(IRRsimData$nLevels) <- c('Two', 'Three', 'Four', 'Five')
IRRsimData$nLevels <- as.character(IRRsimData$nLevels)
levels(IRRsimData$k) <- c('Two', 'Four', 'Eight', 'Sixteen')
IRRsimData$k <- as.character(IRRsimData$k)

Simulated r prettyNum(nrow(IRRsimData), big.mark = ',') scoring matrices¹ (each 100 x k) with k between 2 and 16 raters; between 2 and 5 scoring levels; and four response distributions (i.e. uniform, lightly skewed, moderately skewed, and highly skewed).

For each scoring design (i.e. $k$, $k_m$, & number of scoring levels), the following model was estimated:

$$ICC = \beta_{PRA}^2 + \beta_{PRA} + B_0$$

For ICC(1,1), ICC(2,1), and ICC(3,1), $R^2 \ge 0.80$

For ICC(1,k), ICC(2,k), and ICC(3,k), the relationship is weaker as $k_m$ approaches $k$.

However, when $k_m < \frac{k}{2}$, $R^2 \ge 0.80$.

rsquared.df <- data.frame()
for(n in unique(IRRsimData$nLevels)) {
    tmp.n <- IRRsimData[IRRsimData$nLevels == n,]
    for(k in unique(tmp.n$k)) {
        tmp.k <- tmp.n[tmp.n$k == k,]
        for(k_per_event in unique(tmp.k$k_per_event)) {
            tmp <- tmp.k[tmp.k$k_per_event == k_per_event,]
            icc1.out <- lm(ICC1 ~ agreement + I(agreement^2), data = tmp)
            icc2.out <- lm(ICC2 ~ agreement + I(agreement^2), data = tmp)
            icc3.out <- lm(ICC3 ~ agreement + I(agreement^2), data = tmp)
            icc1k.out <- lm(ICC1k ~ agreement + I(agreement^2), data = tmp)
            icc2k.out <- lm(ICC2k ~ agreement + I(agreement^2), data = tmp)
            icc3k.out <- lm(ICC3k ~ agreement + I(agreement^2), data = tmp)
            rsquared.df <- rbind(rsquared.df, data.frame(
                nLevels = n,
                k = k,
                k_per_event = k_per_event,
                ICC1 = summary(icc1.out)$r.squared,
                ICC2 = summary(icc2.out)$r.squared,
                ICC3 = summary(icc3.out)$r.squared,
                ICC1k = summary(icc1k.out)$r.squared,
                ICC2k = summary(icc2k.out)$r.squared,
                ICC3k = summary(icc3k.out)$r.squared
            ))
        }
    }
}
r.squared.summary <- data.frame(
    `Mean` = apply(rsquared.df[,-c(1:3)], 2, mean),
    `Median` = apply(rsquared.df[,-c(1:3)], 2, median),
    `Minimum` = apply(rsquared.df[,-c(1:3)], 2, min),
    `Maximum` = apply(rsquared.df[,-c(1:3)], 2, max),
    stringsAsFactors = FALSE
)
knitr::kable(r.squared.summary, digits = 2)

Shiny Application {.smaller}

An interactive Shiny web application was developed to facilitate the simulation of scoring matrices for any particular scoring design.

IRRsim_demo()

![](IRRsimShinyApp.png)

Discussion { .smaller }

ICC and percent agreement are highly correlated with PRA accounting for at least 80% of the variance in ICC, but more than 90% when $k_m = 2$.

Skewness in the response distribution does not appear to impact ICC. This may not be desirable. As one score increases in likelihood (say, response C in four possible scores), agreement by chance of the raters should also increase. ICC, which is suppose to account for agreement-by-chance, does not for this situation.

When $k_m = 2$ (and more generally when $k_m < \frac{k}{2}$), there is a substantial penalty in ICC as k increases.

The published guidelines for interpreting ICC may not be appropriate for one-way designs (i.e. ICC(1,1) and ICC(1,k)). For many common scoring designs in education (e.g. $k_m = 2$, $k > 20$), achieving "excellent" (Cicchetti, 2001) inter-rater reliability is not possible.

The Shiny application is a tool to help researchers interpret their ICC results in relation to the percent agreement achieved.

Recommendations

Researchers should report their full scoring design including $k$, $k_m$, and number of scoring levels.
Report/utilize percent agreement. This metric is much more understandable.

Thank You! {.smaller}

Jason Bryer, Ph.D.

Assistant Professor and Associate Director
City University of New York
jason@bryer.org

Website: irrsim.bryer.org

Source Code: github.com/jbryer/IRRsim

DAACS was developed under grant #P116F150077 from the U.S. Department of Education. However, the contents do not necessarily represent the policy of the U.S. Department of Education, and you should not assume endorsement by the Federal Government.

jbryer/IRRsim documentation built on April 23, 2023, 1:58 a.m.

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

jbryer/IRRsim
Simulate Inter-Rater Reliability Statistics

In jbryer/IRRsim: Simulate Inter-Rater Reliability Statistics

Motivation

Guiding Research Questions

Measuring Inter-rater Reliability {.smaller}

Components of Measuring IRR {.smaller}

{.smaller}

Guidelines for Interpreting IRR {.columns-2 }

Guidelines for Interpreting IRR

Guidelines for Education

The IRRsim R Package {.smaller}

`simulateRatingMatrix` {.smaller}

Simulating Scoring Matrices { .columns-2 .smaller }

`simulateIRR`

{.centered}

{.centered}

{.centered}

{.smaller}

Simulating Different Response Distributions {.smaller}

{.centered}

Modeling ICC from Percent Agreement {.columns-2 .smaller}

Shiny Application {.smaller}

Discussion { .smaller }

Recommendations

Thank You! {.smaller}

Jason Bryer, Ph.D.

Website: irrsim.bryer.org

Source Code: github.com/jbryer/IRRsim

R Package Documentation

Browse R Packages

We want your feedback!

jbryer/IRRsim Simulate Inter-Rater Reliability Statistics

In jbryer/IRRsim: Simulate Inter-Rater Reliability Statistics

Motivation

Guiding Research Questions

Measuring Inter-rater Reliability {.smaller}

Components of Measuring IRR {.smaller}

{.smaller}

Guidelines for Interpreting IRR {.columns-2 }

Guidelines for Interpreting IRR

Guidelines for Education

The IRRsim R Package {.smaller}

simulateRatingMatrix {.smaller}

Simulating Scoring Matrices { .columns-2 .smaller }

simulateIRR

{.centered}

{.centered}

{.centered}

{.smaller}

Simulating Different Response Distributions {.smaller}

{.centered}

Modeling ICC from Percent Agreement {.columns-2 .smaller}

Shiny Application {.smaller}

Discussion { .smaller }

Recommendations

Thank You! {.smaller}

Jason Bryer, Ph.D.

Website: irrsim.bryer.org

Source Code: github.com/jbryer/IRRsim

R Package Documentation

Browse R Packages

We want your feedback!

jbryer/IRRsim
Simulate Inter-Rater Reliability Statistics

`simulateRatingMatrix` {.smaller}

`simulateIRR`