In iriseekhout/Agree: Agreement and reliability between multiple raters

knitr::opts_chunk$set(
  collapse = TRUE,
  comment = ">"
)
library(dplyr)
library(tidyr)

This Document describes the computational background and the use of the icc() function from the Agree package. We developed the icc() functions for this package in connection with a simulation study about sample size requirements for studies on reliability and measurement error @mokkink1 and a methodological paper about how to design and conduct a study on reliability and measurement error @mokkink2.

library(Agree)

Data example

The intra-class agreement is usually obtained for continuous ratings. As an example we can use data from data study by @dikmans2017. This data is based on photographs of breasts of 50 women after breast reconstruction. The photographs are independently scored by 5 surgeons, the patients, and three mammography nurses. They each rated the quality of the reconstruction on a 5 point ordinal scale with the verbal anchors on the left side ‘very dissatisfied’ on the left end and on the right end ‘very satisfied’ on the right end. They specifically rated the volume, shape, symmetry, scars and nipple. For the icc examples we can use the sum scores for volume, shape, symmetry, scars and nipple as an overall rating from each rater.

breast_scores <- 
Agree::breast %>%
  dplyr::select(Patient_score, PCH1_score, PCH2_score, PCH3_score, PCH4_score, 
                PCH5_score, Mam1_score, Mam2_score, Mam3_score)

head(breast_scores)

The example data shows missings. The icc function can deal with these missings, because a mixed model is used to estimate the variances to compute the icc with.

For a mixed model, the data needs to be restructured to a long format. we can use the pivot_longer() function from the tidyr package to do that:

breast_long <- breast_scores %>%
 mutate(id = 1:nrow(breast_scores)) %>% #add id column
  pivot_longer(cols = -id, names_to = "rater", values_to = "score")

breast_long

ICC model

The variances that are used to compute the icc are obtained from a linear mixed model. This model is estimated with the lmer() function from the lme4 package. The model is defined as $Y_{ijr} = \beta_0 + b_{0j} + b_{0r} + \epsilon_{ijr}$, where $b_{0j}$ is the random intercept at the subject level and $b_{0r}$ the random intercept at the rater/observer level. The $\epsilon_{ijr}$ is the residual error. The r-code for the model in lme4 is: lmer(score ~ (1|id) + (1|observer), data, REML = T)

This same model is used to estimate the variance components for each of three types of ICC's: ICC oneway, ICC agreement and ICC consistency. Each ICC is used in a different context of a study design and has different assumptions. The following variance components can be obtained directly from the model:

$\sigma^2_j$: subject variance
$\sigma^2_r$: rater variance
$\sigma^2_\epsilon$: residual variance

# Estimate model for example data
icc_model2(breast_long)

# Extract variance components from the icc model
varcomp <- as.data.frame(lme4::VarCorr(icc_model2(breast_long)))
varcomp[,c(1,4)]

Types of ICC

There are three types of icc incorporated in the icc function. The ICC oneway, ICC agreement and the ICC consistency.

ICC oneway

The ICC type oneway is the variance between the subjects ($\sigma^2_j$) divided by the sum of the subject variance ($\sigma^2_j$) and the residual variance ($\sigma^2_{\epsilon}$). The $ICC_{oneway}$ is computed as follows:

$ICC_{oneway} = \frac{\sigma^2_j}{\sigma^2_j + \sigma^2_{\epsilon}}$

The ICC oneway assumes that each subject is rated by a different set of raters, that are randomly selected from a larger population of judges (@shrout1979). The icc_oneway() uses the icc_model() function to compute the variances. This is a lmer model with random slope for the subjects as well as the raters. The rater variance is not separately used for the ICC oneway and is subtracted from the sum of subject variance over the raters, which is then averaged: $\sigma^{2}{j} = \frac{ k\sigma^2_j + \sigma^2_r}{k}$ The error variance is computed as the sum of the residual variance and the rater variance from the icc_model: $\sigma^2_{\epsilon} = \sigma^2_r + \sigma^2\epsilon$ Accordingly, the rater variance is part of the error variance.

The standard error of measurement ($sem$) is the square root of this error variance (i.e. $sem = \sigma^2_{\epsilon}$). The confidence intervals are computed with the exact F method. $F = \frac{k \sigma^2_{j} + \sigma^2_{\epsilon}}{\sigma^2_{\epsilon}}$, with $df1 = n - 1$ and $df2 = n (k - 1)$ (@shrout1979).

# ICC oneway 
icc(breast_long, format = "long", method = "oneway")


# ICC oneway with variance components
icc(breast_long, format = "long", method = "oneway", var = TRUE)

ICC agreement

The icc type agreement is the variance between the subjects ($\sigma^2_j$) divided by the sum of the subject variance ($\sigma^2_j$), rater variance ($\sigma^2_k$) and the residual variance ($\sigma^2_\epsilon$). The $ICC_{agreement}$ is computed as follows:

$ICC_{agreement} = \frac{\sigma^2_j}{\sigma^2_j + \sigma^2_k + \sigma^2_{\epsilon}}$

The ICC for agreement generalizes to other raters within a population (@shrout1979). All subjects are rated by the same set of raters, and the rater variance is taken into account in the calculation of the ICC. The variance components are computed with the icc_model() function. This is a lmer model with a random slope for the subjects and for the raters. The $sem$ is the square root of the sum of the rater variance and the error variance (i.e. $sem = \sqrt{\sigma^2_r + \sigma^2_\epsilon}$). The confidence intervals are approximated to account for the three independent variance components, as defined by @satter1946 & @shrout1979.

# ICC agreement 
icc(breast_long, format = "long", method = "agreement")


# ICC oneway with variance components
icc(breast_long, format = "long", method = "agreement", var = TRUE)

ICC consistency

The ICC type consistency is the variance between the subjects ($\sigma^2_j$) divided by the sum of the subject variance ($\sigma^2_j$) and the residual variance ($\sigma^2_\epsilon$). The rater variance is separated from the subject variance and error variance, but the rater variance is not used to calculate the ICC. The rater variance can therefore also be considered as a fixed effect. The $ICC_{consistency}$ is computed as follows:

$ICC_{consistency} = \frac{\sigma^2_j}{\sigma^2_j + \sigma^2_{\epsilon}}$

The ICC for consistency generalizes only to the set of raters in the data (@shrout1979). The icc_model() function is used to compute the variances. This is a lmer model with a random slope for the subjects as well as for the raters. The sem is the square root of the error variance, ignoring the variance between raters. The confidence are computed with the exact F method. $F = \frac{(k \sigma^2_j + \sigma^2_\epsilon)}{\sigma^2_\epsilon}$, with $df1 = n - 1$ and $df2 = (n - 1) (k - 1)$ (@shrout1979).

# ICC consistency 
icc(breast_long, format = "long", method = "consistency")


# ICC consistency with variance components
icc(breast_long, format = "long", method = "consistency", var = TRUE)

Comparing ICC methods

The differences in computations between the ICC methods can quickly be seen in the variance components. We can obtain the variances by using var = TRUE in the icc() function, the varr shows the variance between the raters. Only the icc agreement estimates this separately.

# ICC for all methods
icc(breast_long, format = "long", var = TRUE)

When we estimate the ICC for the surgeons only, we can see that the variance at the rater level is decreased. This effect is directly shown in the ICC.

In the icc we can also use the data in wide format and use the cols option to define the rater columns that we want to use.

# ICC for all methods
icc(breast_scores, format = "wide", 
    cols = c("PCH1_score", "PCH2_score", "PCH3_score", "PCH4_score", 
             "PCH5_score"), var = TRUE)

When we estimate the ICC for the mammography nurses only, we see that the variance at the rater level is increased. This effect is directly shown in the ICC.

# ICC for all methods
icc(breast_scores, format = "wide", cols = c("Mam1_score", "Mam2_score", "Mam3_score"), var = TRUE)

Overview technical terms

|Term |Description| |-----|------------------------------------| |$\beta_0$|Fixed intercept| |$b_{0j}$|Random intercept at subject level| |$b_{0r}$|Random intercept at rater level| |$\epsilon_{ijr}$|Residual error| |$\sigma_j$|Variance between subjects| |$\sigma_{j}$|Variance between subjects without considering rater variance| |$\sigma_r$|Variance between raters| |$\sigma_\epsilon$|Residual error variance| |$\sigma_{\epsilon}$|Residual error variance without considering rater variance| |$k$|Number of raters/observers| |$n$|Number of subjects|