r "Column {data-width=600}"
r "Measurement {.no-title}"
The ICC is the intraclass correlation coefficient. It provides a way to measure the correlation of data within measurement units (classes). For interrater reliability checks, it estimates how similar raters' scores are within each measurement unit. See the interpretation section below for a longer description of the ICC.
no (one-way model)
.yes (two-way model)
.In the one-way model, the data are repeated or grouped in just one way (participants who have multiple ratings). @hallgren2012 elaborates:
In the two-way case, where some raters evaluated multiple participants, there is
another decision one can make: Whether the estimated reliability should
generalize to new raters (two-way random) or raters should be treated as fixed
(two-way mixed). This app does not support the latter option, but it is
supported by the psych::ICC()
function.
In general, if a new person---a new student, clinician, etc.---could be trained to perform the rating task, the two-way random is a reasonable default. See how Koo and Li [-@koo2016] emphasize generalizability for two-way random models:
single rating
.average
rating
.Yoder and Symons [-@2010observational] spell out the difference:
Note that changing from single-rating to average-rating reliability means that we cannot talk about single scores being reliable---we only know about the reliability of the averages. Shrout and Fleiss [-@icc1979], in their landmark survey of ICC types, illustrates this point:
absolute agreement
.absolute agreement
.consistency
.no (one-way model)
, only agreement is
available.Suppose you have two teachers, and one of the them has a reputation of being a hard grader. They both grade the same five students. The hard grader gives scores of 60%, 68%, 78%, 80%, 85%, and the other teacher gives scores of 80%, 88%, 98%, 100%, 100%. They give the students the same rankings, but they differ in their average score. Because the teachers each rate more than one student, this is a two-way model, and because in most contexts, each student ever one receives one grade on an assignment, we want to know single score reliability. The consistency-based ICC score is ICC(C,1) = .97, so they are almost perfectly reliable at ranking the students. The absolute-agreement based score is ICC(A,1) = .32, so it is very difficult to compare ratings between the teachers. In order to interpret a score, you would want to know who the teacher was or you would want to have the teachers grade on curve (i.e., renormalize them to remove the average score from each teacher).
In the one-way model, where every rating is done by a unique rater, there is no way to assess whether one rater is a consistently harder scorer compared to another rater. The only differences are absolute differences from each other. Therefore, only absolute agreement is available for one-way models.
Intelligibility ratings. We have naive listeners transcribe recordings of
children. Each child is transcribed by two unique listeners; listeners ever only
hear one child. We combine these transcriptions into a single score. This
situation requires a one-way, agreement-based, average rating
ICC.
Language sample coding. We have two students in the lab
transcribe interactions between a parent and child. We want to know whether the
word counts or utterance counts are similar between transcribers. As a
reliability check, both students transcribe the same subset of data, but the
eventual analysis on the larger data will use just one transcription per child.
This situations requires a two-way, agreement-based, single rating
ICC.
Generally speaking, for an interrater reliability "check" situation---where
multiple raters score a subset of the overall data but most of the data was
scored by just one rater---use single rating
.
The ICC is the intraclass correlation coefficient. It provides a way to measure the correlation of data within measurement units (classes). For example, suppose we give the same assessment to the same 10 children on three occasions. In general, we would want the scores to be correlated within each child, so that a child attains a similar score on each occasion. The ICC estimates this correlation.
Now, let's change the example from children tested three times to children who visit a research lab once but have their data scored by three different raters (or judges or coders). Then the ICC would measure how similar the scores are within children. If scores are very similar within children, then the differences between judges are small and the judges have high agreement with each other. This is how ICC works as a measure of interrater reliability.
The ICC shows up frequently in the literature on multilevel or repeated measurement data. Think of children nested in different classrooms or experimental trials nested in a participant. I mention this context because that's the frame of reference for the texts I quote from.
Snijders and Bosker [-@1999multilevel], thinking about individuals nested in groups, provide two interpretations:
In general, the second interpretation is the more common one, and most definitions of ICC talk about the variation between groups versus the variation within groups. So, when Snijders and Bosker say "fraction of total variability", they have a fraction like the following in mind:
r "$$"
\
\frac{\text{between-group variation}}{\text{total variation}} =
\frac{\text{between-group variation}}{\text{between-group variation} +
\text{within-group variation}}
r "$$"
\
In the context of interrater reliability, the groups are the participants who are being rated. Between-group variation is how the participants differ from each other, and within-group variation is how the ratings differ within the participants (i.e., between raters).
Technically, the actual fractions to compute ICC scores are more involved than the one above, as they account for between-rater variation. Still, Yoder and Symons [-@2010observational] support the ICC-as-a-proportion interpretation for interrater reliability:
Kreft and De Leeuw [-@kreft1998introducing], thinking about individuals nested in groups, do a good job explaining what it means for between-group variation to be low compared to within-group variation; that is, what a low ICC means:
If your reliability check is showing low ICC scores, the differences between the judges' ratings are so large that they might as be comparing their ratings for different participants.
r "Column {data-width=300}"
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.