Suppose N, M ∈ ℕ with N >> M. Given the fact that we have N documents and M reviewers it is not always possible that every reviewer can rate each document, This introduces possibly some kind of unfairness to the competition between documents, because some reviewers are more strict than others and some documents could get rated by very strict reviewers only, whereas other documents only get graceful reviewers.
Thats a good and critical question and potentially debatable. We decided to make the following assumption:
For a fixed document D the average of rating points converges to the "ideal balanced value" as the number of reviewers goes to infinity. If a single reviewer has rated with a value that is below this ideal limit than she is said to be strict. If indeed she had rated the document better than this limit she is considered to be graceful.
We use very basic probability theory to determine the expected values of an arbitrary rating and and a rating by a specific reviewer. We compute the ratio of these values for each reviewer. If the reviewer is rather strict, this factor is above 1.0. If the reviewer is perfectly on par with the "global" distribution, then the factor is 1.0 and if she is rather graceful it is below 1.0. We then multiply the points of each document with the factors to balance out the strictness.
The answer: I started dabbling and thinking about the problem before I knew the Standard Score. As I encountered the world of statistics I realized that the standard score should be the better approach since it takes the variance into account. Nevertheless I kept this as a nice little experiment.
The repo itself is also shaped as a R package. See i.e. the R script getting_started.R as a introductory tour.
devtools::install_github("neumanrq/fairreviewers")
library(fairreviewers)
library(here)
# The rating dataset:
#
# Each column contains the ratings
# of a reviewer.
#
# Each row describes the ratings that
# a single applicant got.
data.env <- randomDataset(20, 5)
head(data.env$ratings)
# Output looks like this:
#
# Roy Rose Robin Ricarda Ryan
# Arthur 10 10 NA 3 8
# Aisha NA 10 3 7 7
# Anna 10 10 3 NA 9
# Aisha 10 9 4 2 8
# Ashley 9 9 4 4 NA
# Ally NA 10 6 2 10
# …
# (Note that duplicated names can occur!)
# Let's start the analysis!
review <- fairreviewers::init(data.env$ratings)
strictness <- review$strictness # contains strictness factors for each reviewer
print(strictness)
# Output looks like this
#
# Arithmetic Strictness Expected Strictness
# Roy 0.7541836 0.7540541
# Rose 0.7697874 0.7696552
# Robin 1.7717328 1.7714286
# Ricarda 1.5945595 1.5942857
# Ryan 0.8455997 0.8454545
#
# Interpreation: Robin and Ricarda are quite strict
# reviewers, but Roy is a quite graceful one
result <- review$result
print(result)
# Roy Rose Robin Ricarda Ryan Mean Rating after AS correction Rating after ES correction
# Arthur 10 10 NA 3 8 7.75 6.70 6.70
# Aisha NA 10 3 7 7 6.75 7.52 7.52
# Anna 10 10 3 NA 9 8.00 7.04 7.04
# Aisha 10 9 4 2 8 6.60 6.30 6.30
# Ashley 9 9 4 4 NA 6.50 6.80 6.79
# Ally NA 10 6 2 10 7.00 7.49 7.49
# Ali 9 9 2 4 6 6.00 5.74 5.74
# Ali 9 10 1 6 8 6.80 6.52 6.52
# Albert 8 NA 3 5 4 5.00 5.68 5.67
# Albert 10 5 4 NA 9 7.00 6.52 6.52
# Ally 9 6 6 1 NA 5.50 5.91 5.91
# Ali 10 9 NA 5 7 7.75 7.09 7.09
# Arthur 6 NA 5 6 8 6.25 7.43 7.43
# Amanda 10 9 6 3 7 7.00 7.16 7.16
# Anna 8 7 2 NA NA 5.67 4.99 4.99
# Aisha NA 9 5 5 8 6.75 7.63 7.63
# Ally 7 NA 4 3 8 5.50 5.98 5.98
# Aljona 8 7 4 5 7 6.20 6.48 6.48
# Amanda 8 8 1 4 8 5.80 5.42 5.42
# Amanda 7 8 NA 5 10 7.50 6.97 6.97
To run the test suite, open and run the script dev.R. It employs testthat from the tidyverse.
The relationship between document author and reviewer may play an important rule which is not reflected in our model - since the data motivating this package does not reveal anything about it.
Another interesting question to consider is the notion of "competence" of the reviewer regarding to the document content. Is the reviewer suitable for rating the document? That's another point we had to abstract from: We assume that all reviewers do have the compentence and willingness to rate all documents fairly (to their measure).
There are also some technical points one could question: Is the style of correction, multiplying by the factors, a good way? Though it looked good in some datasets I've worked with, it may behave weird in other cases, so my advice in general would be: Use with care and reflect what this library is showing you :)
Our method only makes use of the expected values, which does not take the very important variance/standard deviation into account, so it is missing this dimension of information.
We are always open for feedback and improvement, so let us know in case you want to share some thoughts!
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.