fcKappa | R Documentation |
The functions take a “confusion matrix”, a square matrix were
the rows and columns represent classifications by two different
raters, and compute measures of rater agreement. Cohen's kappa
(fcKappa
) is corrected for random labeling of ratings. Goodman
and Kruskall's lambda (gkLambda
) is corrected for labeling
every subject at the modal category.
fcKappa(tab, weights = c("None", "Linear", "Quadratic"), W=diag(nrow(tab)))
gkLambda(tab, weights = c("None", "Linear", "Quadratic"), W=diag(nrow(tab)))
accuracy(tab, weights = c("None", "Linear", "Quadratic"), W=diag(nrow(tab)))
tab |
A square matrix whose rows and columns represent rating categories from two raters or classifiers and whose cells represent observed (or expected) counts. If one classifier is regarded as “truth” it should be represented by columns. |
weights |
A character scalar which should be one of
“None”, “Linear”, or |
W |
A square matrix of the same size as tab giving the weights
(see details). If missing and |
Lets say the goal is to classify a number of subjects into K
categories, and that two raters: Rater 1, and Rater 2 do the
classification. (These could be human raters or machine
classification algorithms. In particular a Bayes net modal
predication is a classier.) Let p_{ij}
be the probability that
Rater 1 places a subject into Category i
and Rater 2 places the
same subject into Cateogry j
. The K\times K
matrix,
tab
, is the confusion matrix.
Note that tab
could be a matrix of probabilities or a matrix of
counts, which can be easily turned into a matrix of probabilities by
dividing by the total. In the case of a Bayes net, expected counts
could be used instead. For example, if Rater 1 was a Bayes net and
the predicted probabilities for the three categories was
(.5,.3,.2)
, and and Rater 2 was the true category which for this
subject was 1, then that subject would contribute .5 to p_{1,1}
,
.3 to p_{2,1}
, and .2 to p_{3,1}
.
In both cases, \sum{p_{kk}}
, is a measure of agreement between
the two raters. If scaled as probabilities, the highest possible
agreement is +1
and the lowest 0
. This is the accuracy
function.
However, raw agreement has a problem as a measure of the quality of a rating, it depends on the distributions of the categories in the population of interest. In particular, if a majority of the subject are of one category, then it is very easy to match just by labeling everything as the most frequent category.
The most well-known correct is the Fliess-Cohen kappa
(fcKappa
). This adjusts the agreement rate for the probability
that the raters will agree by chance. Let p_{i+}
be the row
sums, and Let p_{+j}
be the column sums. The probability of a
chance agreement, is then \sum p_{k+}p_{+k}
. So the adjusted
agreement is:
\kappa = \frac{\sum p_{kk} - \sum p_{k+}p_{+k}}{1 - \sum
p_{k+}p_{+k}} .
So kappa answers the question how much better do the raters do than chance agreement.
Goodman and Kruskal (1952) offered another way of normalizing. In
this case, let Rater 1 be the true category and Rater 2 be the estimated
category. Now look at a classifier which always classifies somebody in
Category k
; that classifier will be right with probability
p_{k+}
. The best such classifer will be \max p_{k+}
. So
the adjusted agreement becomes:
\lambda = \frac{\sum p_{kk} - \max p_{k+}}{1 - \max
p_{k+}} .
Goodman and Kruskal's lambda (gkLambda
) is appropriate when
there is a different treatment associated with each category. In this
case, lambda describes how much better one could do than treating
every subject as if they were in the modal category.
Weights are used if the misclassification costs are not equal in all
cases. If the misclassification cost is c_{ij}
, then the weight
is defined as w_{ij} = 1 - c_{ij}/\max c_{ij}
. Weighted
agreeement is defined as \sum\sum w_{ij}p_{ij}
.
If the categories are ordered, there are three fairly standard weighting schemes (especially for kappa).
w_{ij} = 1
if i=j
, 0 otherwise. (Diagonal
matrix.)
w_{ij} = 1 - |i-j|/(K-1)
. Penalty increases
with number of categories of difference.
w_{ij} = 1 - (i-j)^2/(K-1)^2
. Penalty increases
with square of number of categories of difference.
Indeed, quadratic weighted kappa is something of a standard in comparing two human raters or a single machine classification algorithm to a human rater.
The argument weights
can be used to select one of these three
weighting schemes. Alternatively, the weight matrix W
can be
specified directly.
A real number between -1 and 1; with higher numbers indicating more agreement.
Russell Almond
Almond, R.G., Mislevy, R.J. Steinberg, L.S., Yan, D. and Willamson, D. M. (2015). Bayesian Networks in Educational Assessment. Springer. Chapter 7.
Fleiss, J. L., Levin, B. and Paik, M. C. (2003). Statistical Methods for Rates and Proportions. Wiley. Chapter 18.
Goodman, Leo A., Kruskal, William H. (1954). Measures of Association for Cross Classifications. Journal of the American Statistical Association. 49 (268), 732–764.
table
## Example from Almond et al. (2015).
read <- matrix(c(0.207,0.029,0,0.04,0.445,0.025,0,0.025,0.229),3,3,
dimnames=list(estimated=c("Advanced","Intermediate","Novice"),
actual=c("Advanced","Intermediate","Novice")))
stopifnot (abs(fcKappa(read)-.8088) <.001)
stopifnot (abs(gkLambda(read)-.762475) <.001)
fcKappa(read,"Linear")
fcKappa(read,"Quadratic")
gkLambda(read,"Linear")
gkLambda(read,"Quadratic")
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.