In jmgirard/agreement: What the Package Does (One Line, Title Case)

knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)

#install.packages("devtools")
#devtools::install_github("jmgirard/agreement")
library(agreement)
set.seed(2021)

Simulate some data of the proposed type (i.e., four raters assigning 100 objects to 4 categories ${\text{NA}, 0, 1, 2}$). We will simulate it in wide format first where each object is a row and each rater is a column.

categories <- c("NA", "0", "1", "2")
wide_data <- 
  data.frame(
    r1 = sample(categories, size = 100, replace = TRUE),
    r2 = sample(categories, size = 100, replace = TRUE),
    r3 = sample(categories, size = 100, replace = TRUE),
    r4 = sample(categories, size = 100, replace = TRUE)
  )
head(wide_data)

We can now reshape the data to long format in order to give the {agreement} package what it expects.

long_data <- agreement::wide_to_long(wide_data)
long_data

We can now create our custom weight matrix with identity weights for the NA category and linear weights for the 0, 1, and 2 categories.

w <- matrix(
  data = c(1.0, 0.0, 0.0, 0.0,
           0.0, 1.0, 0.5, 0.0,
           0.0, 0.5, 1.0, 0.5,
           0.0, 0.0, 0.5, 1.0),
  nrow = 4,
  ncol = 4
)
dimnames(w) <- list(categories, categories)
w

Now we can calculate all chance-adjusted indexes of categorical agreement using these custom weights. For now, we can turn off bootstrapping by setting bootstrap = 0.

agreement::cat_adjusted(
  long_data,
  categories = categories,
  weighting = "custom",
  custom_weights = w,
  bootstrap = 0
)

Here, we see that observed agreement was around 0.37 and the amount of expected chance agreement was around 0.37-0.38. Given that observed agreement was lower than expected chance agreement, chance-adjusted agreement was very low for all metrics (e.g., around -0.01). This makes sense that agreement was poor because the data was randomly generated above.

Compare this with using identity weights (treating the four categories as nominal):

agreement::calc_weights("identity", categories)

agreement::cat_adjusted(
  long_data,
  categories = categories,
  weighting = "identity",
  bootstrap = 0
)

The observed and expected agreement are both lower because no partial agreement is being awarded for getting close on the numerical categories. However, their proportions are similar so the adjusted indexes are similar.