library(tidyverse) library(flexdashboard) library(shiny) library(irr) library(printy) library(iccbot) # Default dataset from Shrout and Fleiss. d <- example_shrout_fleiss() getData <- function() { if (is.null(input$file1)) { d } else { read.csv(input$file1$datapath) } } runICC <- function() { req(input$use_twoway_model) req(input$single_or_average) req(input$agreement_or_consistency) model <- ifelse( input$use_twoway_model == "yes (two-way model)", "twoway", "oneway" ) unit <- ifelse( input$single_or_average == "single rating", "single", "average" ) type <- ifelse( input$agreement_or_consistency == "absolute agreement", "agreement", "consistency" ) missing_data <- anyNA(getData()) engine <- ifelse(missing_data, "lme4", "irr") add_formatted_results_to_icc( run_icc( getData(), model = model, unit = unit, type = type, engine = engine ) ) } lme4_span <- HTML( "<span class=\"citation\">Bates, Mächler, Bolker, & Walker, 2015</span>" ) irr_span <- HTML( "<span class=\"citation\">Gamer, Lemon, Fellows, & Singh, 2019</span>" ) i <- reactive(list( subjects = runICC()[["subjects"]], n_trials = runICC()[["raters"]], unit_p = runICC()[["unit_p"]], unit_p2 = runICC()[["unit_p2"]], type_p = runICC()[["type_p"]], model_p = runICC()[["model_p"]], icc.name = runICC()[["icc.name"]], value = runICC()[["value"]], lbound_p = runICC()[["lbound_p"]], ubound_p = runICC()[["ubound_p"]], raters_p = runICC()[["raters_p"]], rater_participant_counts_p = runICC()[["rater_participant_counts_p"]], engine = runICC()[["engine"]], n_ratings = runICC()[["n_ratings"]], n_ratings_missing = runICC()[["n_ratings_missing"]], rater_participant_counts_p = runICC()[["rater_participant_counts_p"]], min_ratings_per_participant = runICC()[["min_ratings_per_participant"]], max_ratings_per_participant = runICC()[["max_ratings_per_participant"]], min_participants_per_rater = runICC()[["min_participants_per_rater"]], max_participants_per_rater = runICC()[["max_participants_per_rater"]], citation = if (runICC()[["engine"]] == "lme4") { glue::glue(' the R packages lme4 (vers. {packageVersion("lme4")}; {lme4_span}) and irr (vers. {packageVersion("irr")}; {irr_span})') } else { glue::glue('the irr R package (vers. {packageVersion("irr")}; {irr_span})') } ))
Upload a csv file of scores with one column per rater and no other columns.
fileInput( "file1", "Choose CSV File", accept = c( "text/csv", "text/comma-separated-values,text/plain", ".csv") ) selectInput( "use_twoway_model", label = "Do raters evaluate more than one participant?", choices = c("yes (two-way model)", "no (one-way model)"), multiple = FALSE ) selectInput( "single_or_average", label = "Do you want the reliability for a single rating or the reliability for the average rating?", choices = c("single rating", "average rating"), multiple = FALSE ) renderUI({ req(input$use_twoway_model) choices <- if (input$use_twoway_model == "no (one-way model)") { c("absolute agreement") } else { c("absolute agreement", "consistency") } label <- if (input$use_twoway_model == "no (one-way model)") { HTML("<s>Do you want to estimate agreement of raters or consistency of raters?</s> Consistency is not available for one-way models.") } else { "Do you want to estimate agreement of raters or consistency of raters?" } selectInput( "agreement_or_consistency", label = label, choices = choices, multiple = FALSE ) })
Developed by TJ Mahr for The WISC Lab.
We calculated the interrater reliability of [instrument name
] with the
intraclass correlation coefficient (ICC) estimated using
r renderUI(HTML(i()[["citation"]]))
.
r renderUI(HTML(i()[["rater_participant_counts_p"]]))
We used r renderText(i()[["unit_p"]])
, r renderText(i()[["type_p"]])
,
r renderText(i()[["model_p"]])
random effects model, and we found [interpret
the correlation
] agreement among r renderText(i()[["unit_p2"]])
,
r renderText(i()[["icc.name"]])
= r renderText(i()[["value"]])
, 95% CI =
[r renderText(i()[["lbound_p"]])
, r renderText(i()[["ubound_p"]])
].
renderUI( if (i()["engine"] == "lme4") { HTML( '<div id="section-refs"> <div id="section-ref-lme4"> <p>Bates, D., Mächler, M., Bolker, B., & Walker, S. (2015). Fitting linear mixed-effects models using lme4. <em>Journal of Statistical Software</em>, <em>67</em>(1), 1–48. <a href="https://doi.org/10.18637/jss.v067.i01"> https://doi.org/10.18637/jss.v067.i01</a></p> </div> <div id="section-ref-R-irr"> <p>Gamer, M., Lemon, J., Fellows, I., & Singh, P. (2019). <em>irr: Various coefficients of interrater reliability and agreement</em>. Retrieved from <a href="https://CRAN.R-project.org/package=irr"> https://CRAN.R-project.org/package=irr</a></p> </div> </div>' ) } else { HTML( '<div id="section-refs"> <div id="section-ref-R-irr"> <p>Gamer, M., Lemon, J., Fellows, I., & Singh, P. (2019). <em>irr: Various coefficients of interrater reliability and agreement</em>. Retrieved from <a href="https://CRAN.R-project.org/package=irr"> https://CRAN.R-project.org/package=irr</a></p> </div> </div>' ) } )
renderPrint(runICC()) renderPrint(cat( "ICCBot details", "\n", "iccbot version: ", format(utils::packageVersion("iccbot")), "\n", "Variance computed by package: ", i()[["engine"]], "\n", "N ratings: ", i()[["n_ratings"]], "\n", "N missing ratings: ", i()[["n_ratings_missing"]], sep = "" ))
renderPrint(head(getData()))
r "Column {data-width=600}"
r "Measurement {.no-title}"
The ICC is the intraclass correlation coefficient. It provides a way to measure the correlation of data within measurement units (classes). For interrater reliability checks, it estimates how similar raters’ scores are within each measurement unit. See the interpretation section below for a longer description of the ICC.
no (one-way model)
.yes (two-way model)
.In the one-way model, the data are repeated or grouped in just one way (participants who have multiple ratings). Hallgren (2012) elaborates:
In the two-way case, where some raters evaluated multiple participants,
there is another decision one can make: Whether the estimated
reliability should generalize to new raters (two-way random) or raters
should be treated as fixed (two-way mixed). This app does not support
the latter option, but it is supported by the psych::ICC()
function.
In general, if a new person—a new student, clinician, etc.—could be trained to perform the rating task, the two-way random is a reasonable default. See how Koo and Li (2016) emphasize generalizability for two-way random models:
single rating
.average rating
.Yoder and Symons (2010) spell out the difference:
Note that changing from single-rating to average-rating reliability means that we cannot talk about single scores being reliable—we only know about the reliability of the averages. Shrout and Fleiss (1979), in their landmark survey of ICC types, illustrates this point:
absolute agreement
.absolute agreement
.consistency
.no (one-way model)
, only agreement
is available.Suppose you have two teachers, and one of the them has a reputation of being a hard grader. They both grade the same five students. The hard grader gives scores of 60%, 68%, 78%, 80%, 85%, and the other teacher gives scores of 80%, 88%, 98%, 100%, 100%. They give the students the same rankings, but they differ in their average score. Because the teachers each rate more than one student, this is a two-way model, and because in most contexts, each student ever one receives one grade on an assignment, we want to know single score reliability. The consistency-based ICC score is ICC(C,1) = .97, so they are almost perfectly reliable at ranking the students. The absolute-agreement based score is ICC(A,1) = .32, so it is very difficult to compare ratings between the teachers. In order to interpret a score, you would want to know who the teacher was or you would want to have the teachers grade on curve (i.e., renormalize them to remove the average score from each teacher).
In the one-way model, where every rating is done by a unique rater, there is no way to assess whether one rater is a consistently harder scorer compared to another rater. The only differences are absolute differences from each other. Therefore, only absolute agreement is available for one-way models.
Intelligibility ratings. We have naive listeners transcribe recordings
of children. Each child is transcribed by two unique listeners;
listeners ever only hear one child. We combine these transcriptions into
a single score. This situation requires a
one-way, agreement-based, average rating
ICC.
Language sample coding. We have two students in the lab transcribe
interactions between a parent and child. We want to know whether the
word counts or utterance counts are similar between transcribers. As a
reliability check, both students transcribe the same subset of data, but
the eventual analysis on the larger data will use just one transcription
per child. This situations requires a
two-way, agreement-based, single rating
ICC.
Generally speaking, for an interrater reliability “check”
situation—where multiple raters score a subset of the overall data but
most of the data was scored by just one rater—use single rating
.
The ICC is the intraclass correlation coefficient. It provides a way to measure the correlation of data within measurement units (classes). For example, suppose we give the same assessment to the same 10 children on three occasions. In general, we would want the scores to be correlated within each child, so that a child attains a similar score on each occasion. The ICC estimates this correlation.
Now, let’s change the example from children tested three times to children who visit a research lab once but have their data scored by three different raters (or judges or coders). Then the ICC would measure how similar the scores are within children. If scores are very similar within children, then the differences between judges are small and the judges have high agreement with each other. This is how ICC works as a measure of interrater reliability.
The ICC shows up frequently in the literature on multilevel or repeated measurement data. Think of children nested in different classrooms or experimental trials nested in a participant. I mention this context because that’s the frame of reference for the texts I quote from.
Snijders and Bosker (1999), thinking about individuals nested in groups, provide two interpretations:
In general, the second interpretation is the more common one, and most definitions of ICC talk about the variation between groups versus the variation within groups. So, when Snijders and Bosker say “fraction of total variability,” they have a fraction like the following in mind:
r "$$"
\frac{\text{between-group variation}}{\text{total variation}} =
\frac{\text{between-group variation}}{\text{between-group variation}
+ \text{within-group variation}} r "$$"
In the context of interrater reliability, the groups are the participants who are being rated. Between-group variation is how the participants differ from each other, and within-group variation is how the ratings differ within the participants (i.e., between raters).
Technically, the actual fractions to compute ICC scores are more involved than the one above, as they account for between-rater variation. Still, Yoder and Symons (2010) support the ICC-as-a-proportion interpretation for interrater reliability:
Kreft and De Leeuw (1998), thinking about individuals nested in groups, do a good job explaining what it means for between-group variation to be low compared to within-group variation; that is, what a low ICC means:
If your reliability check is showing low ICC scores, the differences between the judges’ ratings are so large that they might as be comparing their ratings for different participants.
r "Column {data-width=300}"
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.