fcKappa: Functions for measuring rater agreement.

fcKappaR Documentation

Functions for measuring rater agreement.

Description

The functions take a “confusion matrix”, a square matrix were the rows and columns represent classifications by two different raters, and compute measures of rater agreement. Cohen's kappa (fcKappa) is corrected for random labeling of ratings. Goodman and Kruskall's lambda (gkLambda) is corrected for labeling every subject at the modal category.

Usage

fcKappa(tab, weights = c("None", "Linear", "Quadratic"), W=diag(nrow(tab)))
gkLambda(tab, weights = c("None", "Linear", "Quadratic"), W=diag(nrow(tab)))
accuracy(tab, weights = c("None", "Linear", "Quadratic"), W=diag(nrow(tab)))

Arguments

tab

A square matrix whose rows and columns represent rating categories from two raters or classifiers and whose cells represent observed (or expected) counts. If one classifier is regarded as “truth” it should be represented by columns.

weights

A character scalar which should be one of “None”, “Linear”, or Quadratic which gives the weighting to be used if W is not supplied directly (see details).

W

A square matrix of the same size as tab giving the weights (see details). If missing and weights are supplied one of the standard weights are used.

Details

Lets say the goal is to classify a number of subjects into K categories, and that two raters: Rater 1, and Rater 2 do the classification. (These could be human raters or machine classification algorithms. In particular a Bayes net modal predication is a classier.) Let p_{ij} be the probability that Rater 1 places a subject into Category i and Rater 2 places the same subject into Cateogry j. The K\times K matrix, tab, is the confusion matrix.

Note that tab could be a matrix of probabilities or a matrix of counts, which can be easily turned into a matrix of probabilities by dividing by the total. In the case of a Bayes net, expected counts could be used instead. For example, if Rater 1 was a Bayes net and the predicted probabilities for the three categories was (.5,.3,.2), and and Rater 2 was the true category which for this subject was 1, then that subject would contribute .5 to p_{1,1}, .3 to p_{2,1}, and .2 to p_{3,1}.

In both cases, \sum{p_{kk}}, is a measure of agreement between the two raters. If scaled as probabilities, the highest possible agreement is +1 and the lowest 0. This is the accuracy function.

However, raw agreement has a problem as a measure of the quality of a rating, it depends on the distributions of the categories in the population of interest. In particular, if a majority of the subject are of one category, then it is very easy to match just by labeling everything as the most frequent category.

The most well-known correct is the Fliess-Cohen kappa (fcKappa). This adjusts the agreement rate for the probability that the raters will agree by chance. Let p_{i+} be the row sums, and Let p_{+j} be the column sums. The probability of a chance agreement, is then \sum p_{k+}p_{+k}. So the adjusted agreement is:

\kappa = \frac{\sum p_{kk} - \sum p_{k+}p_{+k}}{1 - \sum p_{k+}p_{+k}} .

So kappa answers the question how much better do the raters do than chance agreement.

Goodman and Kruskal (1952) offered another way of normalizing. In this case, let Rater 1 be the true category and Rater 2 be the estimated category. Now look at a classifier which always classifies somebody in Category k; that classifier will be right with probability p_{k+}. The best such classifer will be \max p_{k+}. So the adjusted agreement becomes:

\lambda = \frac{\sum p_{kk} - \max p_{k+}}{1 - \max p_{k+}} .

Goodman and Kruskal's lambda (gkLambda) is appropriate when there is a different treatment associated with each category. In this case, lambda describes how much better one could do than treating every subject as if they were in the modal category.

Weights are used if the misclassification costs are not equal in all cases. If the misclassification cost is c_{ij}, then the weight is defined as w_{ij} = 1 - c_{ij}/\max c_{ij}. Weighted agreeement is defined as \sum\sum w_{ij}p_{ij}.

If the categories are ordered, there are three fairly standard weighting schemes (especially for kappa).

None

w_{ij} = 1 if i=j, 0 otherwise. (Diagonal matrix.)

Linear

w_{ij} = 1 - |i-j|/(K-1). Penalty increases with number of categories of difference.

Quadratic

w_{ij} = 1 - (i-j)^2/(K-1)^2. Penalty increases with square of number of categories of difference.

Indeed, quadratic weighted kappa is something of a standard in comparing two human raters or a single machine classification algorithm to a human rater.

The argument weights can be used to select one of these three weighting schemes. Alternatively, the weight matrix W can be specified directly.

Value

A real number between -1 and 1; with higher numbers indicating more agreement.

Author(s)

Russell Almond

References

Almond, R.G., Mislevy, R.J. Steinberg, L.S., Yan, D. and Willamson, D. M. (2015). Bayesian Networks in Educational Assessment. Springer. Chapter 7.

Fleiss, J. L., Levin, B. and Paik, M. C. (2003). Statistical Methods for Rates and Proportions. Wiley. Chapter 18.

Goodman, Leo A., Kruskal, William H. (1954). Measures of Association for Cross Classifications. Journal of the American Statistical Association. 49 (268), 732–764.

See Also

table

Examples


## Example from Almond et al. (2015).
read <- matrix(c(0.207,0.029,0,0.04,0.445,0.025,0,0.025,0.229),3,3,
        dimnames=list(estimated=c("Advanced","Intermediate","Novice"),
                      actual=c("Advanced","Intermediate","Novice")))

stopifnot (abs(fcKappa(read)-.8088) <.001)
stopifnot (abs(gkLambda(read)-.762475) <.001)

fcKappa(read,"Linear")
fcKappa(read,"Quadratic")
gkLambda(read,"Linear")
gkLambda(read,"Quadratic")



ralmond/CPTtools documentation built on Dec. 27, 2024, 7:15 a.m.