README.md

Fair Reviewers

Simple attempts of formalizing strictness in rating processes

The Problem

Suppose N, M ∈ ℕ with N >> M. Given the fact that we have N documents and M reviewers it is not always possible that every reviewer can rate each document, This introduces possibly some kind of unfairness to the competition between documents, because some reviewers are more strict than others and some documents could get rated by very strict reviewers only, whereas other documents only get graceful reviewers.

Ok, can you specify what you mean by strictness?

Thats a good and critical question and potentially debatable. We decided to make the following assumption:

For a fixed document D the average of rating points converges to the "ideal balanced value" as the number of reviewers goes to infinity. If a single reviewer has rated with a value that is below this ideal limit than she is said to be strict. If indeed she had rated the document better than this limit she is considered to be graceful.

Hm. How to you quantify this?

We use very basic probability theory to determine the expected values of an arbitrary rating and and a rating by a specific reviewer. We compute the ratio of these values for each reviewer. If the reviewer is rather strict, this factor is above 1.0. If the reviewer is perfectly on par with the "global" distribution, then the factor is 1.0 and if she is rather graceful it is below 1.0. We then multiply the points of each document with the factors to balance out the strictness.

Ok, that's simple. But why don't you just use the Standard Score?

The answer: I started dabbling and thinking about the problem before I knew the Standard Score. As I encountered the world of statistics I realized that the standard score should be the better approach since it takes the variance into account. Nevertheless I kept this as a nice little experiment.

Can I see it in action?

The repo itself is also shaped as a R package. See i.e. the R script getting_started.R as a introductory tour.

devtools::install_github("neumanrq/fairreviewers")
library(fairreviewers)
library(here)

# The rating dataset:
#
# Each column contains the ratings
# of a reviewer.
#
# Each row describes the ratings that
# a single applicant got.
data.env <- randomDataset(20, 5)

head(data.env$ratings)
# Output looks like this:
#
#       Roy Rose Robin Ricarda Ryan
# Arthur  10   10    NA       3    8
# Aisha   NA   10     3       7    7
# Anna    10   10     3      NA    9
# Aisha   10    9     4       2    8
# Ashley   9    9     4       4   NA
# Ally    NA   10     6       2   10
# …
# (Note that duplicated names can occur!)

# Let's start the analysis!
review <- fairreviewers::init(data.env$ratings)
strictness <- review$strictness # contains strictness factors for each reviewer

print(strictness)
# Output looks like this
#
#         Arithmetic Strictness Expected Strictness
# Roy     0.7541836             0.7540541
# Rose    0.7697874             0.7696552
# Robin   1.7717328             1.7714286
# Ricarda 1.5945595             1.5942857
# Ryan    0.8455997             0.8454545
#
# Interpreation: Robin and Ricarda are quite strict
#                reviewers, but Roy is a quite graceful one

result <- review$result
print(result)
#        Roy Rose Robin Ricarda Ryan Mean Rating after AS correction Rating after ES correction
# Arthur  10   10    NA       3    8 7.75                       6.70                       6.70
# Aisha   NA   10     3       7    7 6.75                       7.52                       7.52
# Anna    10   10     3      NA    9 8.00                       7.04                       7.04
# Aisha   10    9     4       2    8 6.60                       6.30                       6.30
# Ashley   9    9     4       4   NA 6.50                       6.80                       6.79
# Ally    NA   10     6       2   10 7.00                       7.49                       7.49
# Ali      9    9     2       4    6 6.00                       5.74                       5.74
# Ali      9   10     1       6    8 6.80                       6.52                       6.52
# Albert   8   NA     3       5    4 5.00                       5.68                       5.67
# Albert  10    5     4      NA    9 7.00                       6.52                       6.52
# Ally     9    6     6       1   NA 5.50                       5.91                       5.91
# Ali     10    9    NA       5    7 7.75                       7.09                       7.09
# Arthur   6   NA     5       6    8 6.25                       7.43                       7.43
# Amanda  10    9     6       3    7 7.00                       7.16                       7.16
# Anna     8    7     2      NA   NA 5.67                       4.99                       4.99
# Aisha   NA    9     5       5    8 6.75                       7.63                       7.63
# Ally     7   NA     4       3    8 5.50                       5.98                       5.98
# Aljona   8    7     4       5    7 6.20                       6.48                       6.48
# Amanda   8    8     1       4    8 5.80                       5.42                       5.42
# Amanda   7    8    NA       5   10 7.50                       6.97                       6.97

Development

To run the test suite, open and run the script dev.R. It employs testthat from the tidyverse.

Criticism & borders

The relationship between document author and reviewer may play an important rule which is not reflected in our model - since the data motivating this package does not reveal anything about it.

Another interesting question to consider is the notion of "competence" of the reviewer regarding to the document content. Is the reviewer suitable for rating the document? That's another point we had to abstract from: We assume that all reviewers do have the compentence and willingness to rate all documents fairly (to their measure).

There are also some technical points one could question: Is the style of correction, multiplying by the factors, a good way? Though it looked good in some datasets I've worked with, it may behave weird in other cases, so my advice in general would be: Use with care and reflect what this library is showing you :)

Our method only makes use of the expected values, which does not take the very important variance/standard deviation into account, so it is missing this dimension of information.

We are always open for feedback and improvement, so let us know in case you want to share some thoughts!



neumanrq/fairreviewers documentation built on May 24, 2019, 5:06 a.m.