error_group: error_group()
In OTrecod: Data Fusion using Optimal Transportation Theory

error_group

R Documentation

error_group()

Description

This function studies the association between two categorical distributions with different numbers of modalities.

Usage

error_group(REF, Z, ord = TRUE)

Arguments

`REF`	a factor with a reference number of levels.
`Z`	a factor with a number of levels greater than the number of levels of the reference.
`ord`	a boolean. If TRUE, only neighboring levels of Z will be grouped and tested together.

Details

Assuming that Y and Z are categorical variables summarizing a same information, and that one of the two related encodings is unknown by user because this latter is, for example, the result of predictions provided by a given model or algorithm, the function error_group searches for potential links between the modalities of Y to approach at best the distribution of Z.

Assuming that Y and Z have n_Y and n_Z modalities respectively so that n_Y > n_Z, in a first step, the function error_group combines modalities of Y to build all possible variables Y' verifying n_{Y'} = n_Z. In a second step, the association between Z and each new variable Y' generated is measured by studying the ratio of concordant pairs related to the confusion matrix but also using standard criterions: the Cramer's V (1), the Cohen's kappa coefficient (2) and the Spearman's rank correlation coefficient.

According to the type of Y, different combinations of modalities are tested:

If Y and Z are ordinal (ord = TRUE), only consecutive modalities of Y will be grouped to build the variables Y'.
If Y and Z are nominal (ord = FALSE), all combinations of modalities of Y (consecutive or not) will be grouped to build the variables Y'.

All the associations tested are listed in output as a data.frame object. The function error_group is directly integrated in the function verif_OT to evaluate the proximity of two multinomial distributions, when one of them is estimated from the predictions of an OT algorithm.

Example: Assuming that Y = (1,1,2,2,3,3,4,4) and Z = (1,1,1,1,2,2,2,2), so n_Y = 4 and n_Z = 2 and the related coefficient of correlation cor(Y,Z) is 0.89. Are there groupings of modalities of Y which contribute to improving the proximity between Y and Z ? From Y, the function error_group gives an answer to this question by successively constructing the variables: Y_1 = (1,1,1,1,2,2,2,2), Y_2 = (1,1,2,2,1,1,2,2), Y_3 = (1,1,2,2,2,2,1,1) and tests \mbox{cor}(Z,Y_1) = 1, \mbox{cor}(Z,Y_2) = 0, \mbox{cor}(Z,Y_3) = 0. Here, the tests permit to conclude that the difference of encodings between Y and Z resulted in fact in a simple grouping of modalities.

Value

A data.frame with five columns:

`combi`	the first column enumerates all possible groups of modalities of Y to obtain the same number of levels as the reference.
`error_rate`	the second column gives the corresponding rate error from the confusion matrix (ratio of non-diagonal elements)
`Kappa`	this column indicates the result of the Cohen's kappa coefficient related to each combination of Y
`Vcramer`	this column indicates the result of the Cramer's V criterion related to each combination of Y
`RankCor`	this column indicates the result of the Spearman's coefficient of correlation related to each combination of Y

Author(s)

Gregory Guernec

otrecod.pkg@gmail.com

References

Cramér, Harald. (1946). Mathematical Methods of Statistics. Princeton: Princeton University Press.
McHugh, Mary L. (2012). Interrater reliability: The kappa statistic. Biochemia Medica. 22 (3): 276–282

Examples


# Basic examples:
sample1 <- as.factor(sample(1:3, 50, replace = TRUE))
length(sample1)
sample2 <- as.factor(sample(1:2, 50, replace = TRUE))
length(sample2)
sample3 <- as.factor(sample(c("A", "B", "C", "D"), 50, replace = TRUE))
length(sample3)
sample4 <- as.factor(sample(c("A", "B", "C", "D", "E"), 50, replace = TRUE))
length(sample4)

# By only grouping consecutive levels of sample1:
error_group(sample1, sample4)
# By only all possible levels of sample1, consecutive or not:
error_group(sample2, sample1, ord = FALSE)



### using a sample of the tab_test object (3 complete covariates)
### Y1 and Y2 are a same variable encoded in 2 different forms in DB 1 and 2:
### (4 levels for Y1 and 3 levels for Y2)

data(tab_test)
# Example with n1 = n2 = 70 and only X1 and X2 as covariates
tab_test2 <- tab_test[c(1:70, 5001:5070), 1:5]

### An example of JOINT model (Manhattan distance)
# Suppose we want to impute the missing parts of Y1 in DB2 only ...
try1J <- OT_joint(tab_test2,
  nominal = c(1, 4:5), ordinal = c(2, 3),
  dist.choice = "M", which.DB = "B"
)

# Error rates between Y2 and the predictions of Y1 in the DB 2
# by grouping the levels of Y1:
error_group(try1J$DATA2_OT$Z, try1J$DATA2_OT$OTpred)
table(try1J$DATA2_OT$Z, try1J$DATA2_OT$OTpred)

OTrecod documentation built on Oct. 5, 2022, 5:06 p.m.