an-application-of-the-OTrecod-package"
In OTrecod: Data Fusion using Optimal Transportation Theory

knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)

Package installation

If the package OTrecod is not installed in their current R versions, users can install it by following the standard instruction:

install.packages("OTrecod")

Each time an R session is opened, the OTrecod library must be loaded with:

library(OTrecod)

Moreover, the development version of OTrecod can be installed actually from GitHub with:

# Install development version from GitHub
devtools::install_github("otrecoding/OTrecod")

I. The context

This vignette illustrates how to use the tools contained in the **OTrecod** package to solve a variable recoding problem frequently encountered in the context of data fusion. For more details about the theory of the algorithms used in the functions **OT\_outcome** and **OT\_joint** of the package, user can consult (1),(2) and the documentation linked to each function.

For this example, we have transformed an available database called *samp.A* from the **StatMatch** package (see (3) and the help of *samp.A* for more details). The *samp.A* database provides a limited number of variables observed at persons levels among those usually collected in the *European Union Statistics on Income and Living Conditions Survey* (the EU–SILC survey). In this database, the variable *c.neti* is a factor that corresponds to the persons net income of thousand of Euros categorized in 7 ordered classes. The raw distribution of *c.neti* is done in the following results:

library(StatMatch)
data(samp.A)

dim(samp.A)
head(samp.A)
table(samp.A$c.neti)   # Repartition of c.neti in the sample

To construct a standard recoding problem in data fusion, the *samp.A* database has been transformed as follows: - From the *c.neti* variable, we built a variable called *c.neti.bis* that corresponds to a second encoding of the persons net income now coded in 4 ordered levels (1: by grouping the first two levels of *c.neti*, 2: by grouping the two subsequent levels of *c.neti*, 3: by grouping the last two levels). - To improve the running time of the example, we only kept the first 350 subjects (rows) of *samp.A*. - This new sample has been randomly divided in two distinct subsamples called *data1* (200 rows: subjects 1 to 200) and *data2* (150 rows: sujects 201 to 350). - The *c.neti.bis* variable has been excluded from *data1* and respectively *c.neti* have been excluded from *data2*. - Among the other variables, a random subset of them have been assigned in each database. The R code related to these transformations was:

c.neti            = as.numeric(samp.A$c.neti)
samp.A$c.neti.bis = as.factor(ifelse(c.neti %in% c(1,2),1, 
                              ifelse(c.neti %in% c(3,4),2, 
                              ifelse(c.neti %in% c(5,6),3,4)))) 
data1 = samp.A[1:200,c(2:3,5,7:9,12:13)]
colnames(data1)[4] = "age" 
head(data1)
data2 = samp.A[201:350,c(3,5:6,8:11,13:14)]
head(data2)

In conclusion, after transformation, we had: - two separate databases *data1* and *data2* with no common subjet. - *c.neti* and *c.neti.bis*: two variables that summarize a same information (the persons net income) encoded in two different scales in *data1* and *data2*. - a subset of shared variables between the two databases (variables with same labels in the 2 databases). - subsets of specific variables

table(data1$c.neti)        # 7 levels in data1

table(data2$c.neti.bis)    # 4 levels in data2

colnames(data1)

colnames(data2)

intersect(colnames(data1), colnames(data2))   # the susbet of a priori shared variables

OBJECTIVE

Assuming that the encoding of c.neti is unknown for the subjects of data2 and that the encoding of c.neti.bis is unknown for the subjects of data1, the functions of the OTrecod package solve this recoding problem by predicting the missing scale of the persons net income in one or the two databases. This solution allows the user to fusion his two databases and finally works with a bigger, unique and synthetic dataset.

For the rest of the study, *c.neti* and *c.neti.bis* are called the target variables of *data1* and *data2* respectively. we deliberately limit this example to the prediction of the variable *c.neti.bis* in *data1* but note that the proposed approach would be the same for the prediction of *c.neti* in *data1*.

II. Harmonization of the data sources

Knowing the objective of the study, we first prepare the 2 databases to data fusion. The two functions dedicated to this data fusion in the **OTrecod** package expect a specific structure of database as argument. The **merge\_dbs** function assists user in this task by: - overlaying the two databases - detecting the false shared variables - imputing incomplete information of shared variables if desired. For this, we fill in the arguments of the **merge\_dbs** function by declaring as follows, the column indexes of all the ordinal variables in *data1* and *data2* (including the target variables indexes). Here there is no need to handle missing information because all the shared variables are complete, but note that it exists a specific argument (*impute*) in the function to take them into account when necessary.

db_test  = merge_dbs(data1, data2, NAME_Y = "c.neti", NAME_Z = "c.neti.bis",
                     ordinal_DB1 = c(2,3,4,7), ordinal_DB2 = c(1:2,6,9))

summary(db_test)

db_test$REMAINING_VAR

db_test$REMOVE1

db_test$REMOVE2

db_test$ID1_drop; db_test$ID2_drop

db_test$DB_READY[c(1:5,201:205),]   # The 5 1st subjects of the two databases

In output: - The *REMAINING_VAR* object gives the identity of the shared variables kept for the merging of the two databases. - The *REMOVE1* object gives the identity of potential shared variables dropped because of dicrepancies of types: the variable *age* stored as factor in *data1* and as numeric in *data2* is in this case. Sometimes it is possible to reconciliate the two encodings a posteriori, sometimes not. Here it would be possible to categorize the variable *age* in *data2* in the same way as *age* in *data1* but we decide for the example to discard it for the rest of the study. - The *REMOVE2* object is null, and this result means that there is no factor dropped because of discrepancies of levels. - The *ID1_drop* and *ID2_drop* objects are null which means that the target variables *c.neti* and *c.neti.bis* have no missing values in there respective databases. - Finally the *DB_READY* object provides users a synthetic database that overlays *data1* and *data2* (in this order) where the *DB* variable is the database identifier. The variable *Y* now corresponds to the target variable *c.neti* with its specific encoding in *data1* and missing values in *data2*. In the same way, the variable *Z* corresponds to *c.neti.bis*. This database have the structure expected as argument of the **OT\_outcome** and **OT\_joint** functions.

III. Selection of the matching variables

Among the set of shared variables kept in the output *DB.READY* database of the **merge\_dbs**, it is important to discard those that never appear not good predictors of the persons net income whatever the considered database. The subset of remaining variables will be the matching variables. This selection of matching variables can be done using the **select\_pred** function of the package.

This function proposes two levels of study to conclude: - When the sample of shared variables is small, standard correlation studies are often enough to select the best set of predictors and get rid of potential multicolinearities between the candidates. - When the conclusion of the first part appears not so clear, this study can be completed by a random forest (RF) process of selection (see *RF* argument). This part is notably particularly convenient when the sample of covariates is large. The function proposes two random forest procedures: The standard RF which estimates the permutation importance criterion of each shared variable and the conditional importance measures from the *proxy* package ((4)). At this step, the choice of the recoding algorithm (see (2)) to use a posteriori must be anticipated in the handling of the potential numeric shared variables (In this example, the *ww* variable is the only one concerned. Indeed, if the **OT\_outcome** function runs whatever the type considered, this is not the case of the actual version of the **OT\_joint** version which not allows the numeric variable. For this vignette we will test the two algorithms, so we decide to transform *ww* as categorial using the quartiles of its distribution as thresholds of the four classes. This transformation can be dealt by standard R function (like *cut()*) or directly in the functions using the *convert\_num* and *convert\_class* arguments dedicated to this task.

The selection of the best predictors of the persons net income must be done in the two databases separately but the **selec\_pred** functions allowed overlayed databases as arguments using the following code:

# for data1
test_DB1 = select_pred(db_test$DB_READY,Y = "Y", Z = "Z", ID = 1, OUT = "Y",
                       quanti = 8, nominal = c(1,5:6,7), ordinal = c(2:4),
                       convert_num = 8, convert_class = 4,
                       thresh_cat = 0.30, thresh_num = 0.70, thresh_Y = 0.10,
                       RF = TRUE, RF_SEED = 3017)

# for data2
test_DB2 = select_pred(db_test$DB_READY,Y = "Y", Z = "Z", ID = 1, OUT = "Z",
                       quanti = 8, nominal = c(1,5:6,7), ordinal = c(2:4),
                       convert_num = 8, convert_class = 4,
                       thresh_cat = 0.30, thresh_num = 0.70, thresh_Y = 0.10,
                       RF = TRUE, RF_SEED = 3017)

As input: - The *OUT* argument specifies the target variable to predict. - The *thresh\_cat* argument corresponds to the threshold of the V Cramer coefficient (for the categorical variables) . - The *thresh\_num* argument corresponds to the threshold of the Spearman correlation coefficient (for the continuous and ordinal variables). Under these values, two shared variables present acceptable colinearities. Under these values, a shared variable is not considered as a good predictor of a target variable. - The *thresh\_Y* argument is reserved to random forest procedures. It corresponds to an acceptability threshold related to the importance mesure criterions. Variables with a cumulative percent of importance measures less than this threshold will be dropped from the final list of RF predictors.

Notice that it is important to keep the same arguments in the two selections (*test\_DB1* and *test\_DB2*) for an optimal comparability. Here the standard RF process is used, and the *ww* variable is converted in categorical type before the selection using *convert\_num*.

Let see the result for data1:

summary(test_DB1)

test_DB1$vcrm_OUTC_cat

test_DB1$collinear_PB

# Results from RF
test_DB1$drop_var

test_DB1$RF_PRED

In the set of shared variables, all variables are now categorical (factors or ordered factors). According to the *vcrm_OUTC\_cat* output object, the best predictors of *c.neti* are: *sex*, *marital*, *urb*, *hsize5*, and *ww* in that order. Nevertheless, a risk of collinearity is detected between *marital* and *hsize5*. The RF process finally suggests to drop *marital* and to keep only *sex*, *hsize5* and eventually *urb* as matching variables.

Let see the result for data2:

summary(test_DB2)

test_DB2$vcrm_OUTC_cat

test_DB2$collinear_PB

# Results from RF
test_DB2$drop_var

test_DB2$RF_PRED

According to the *vcrm_OUTC\_cat* output object, the best predictors of *c.neti.bis* are: *sex*, *hsize5*, *marital*, *urb*, and *ww* in that order. A risk of collinearity is also detected between *marital* and *hsize5*. The RF process here suggests to drop *marital* and to keep only *sex*, *hsize5* as matching variables.

**In conclusion:** - The variables *sex* and *hsize5* are the best predictors of the target variables in the two databases. - The variable *marital* is dropped whatever the database because of risks of multicollinearity with *hsize5*. - The variable *ww* is never a good predictor of the target information. - The variable *urb* is an acceptable predictor in the first database only.

The matching variable groups can be:

sex, hsize5
sex, hsize5, and urb

**We finally keep the last group and dropped the other ones for the rest of the example.**

match_var = db_test$DB_READY[,-c(5,8)]
match_var[c(1:5,201:205),]

This overlayed database is now ready for recoding.

IV. Predicting the missing scales in the databases

The **OTrecod** package proposes actually two algorithms using optimal transportation theory (see (3) for details) to solve the recoding problem previously introduced. Each algorithm is stored in a unique function called **OT\_outcome** and **OT\_joint**. These two functions can predict the missing values of *c.neti* in *data2*, the missing values of *c.neti.bis* in *data1* or the both using a same argument called *which.DB*.

As with the **select\_pred** function, it is possible to transform directly the continuous matching variables if necessary using the *convert\_num* and *convert\_class* arguments.

Let see the R approach for the prediction of c.neti.bis in data1.

A) Transporting target variables to predict the missing scales

The algorithm from **OT\_outcome** function solves an optimization problem to transfer the distributions of the target variables (or outcome) from one database to another. Using this result, the initial version of the algorithm (see (2)) transfers the distribution of *c.neti.bis* in *data2* to the distribution of *c.neti.bis* in *data1*, and (inversely for *c.neti* if necessary). The algorithm executes in another distinct step, a nearest neighbor procedure to affect the indidividual predictions of *c.neti.bis* in *data1*. This version of the algorithm is actually available by writing *sequential* as argument *method* of the **OT\_outcome** function.

The algorithm assumes the strong hypothesis that the variable *c.neti.bis* has the same distribution in *data1* and *data2* (and so on for the variable *c.neti*). If in this example, by construction, this hypothesis is obviously verified, there are also several contexts where this latter no longer holds.

Enrichments have been thus proposed (see (2)) to relax this hypothesis via the *maxrelax* argument of **OT\_outcome** and directly provides the individual predictions without using the nearest neighbor process. This algorithm is actually available by assigning the *method* argument to *optimal* in input of the **OT\_outcome** function. For our example, the corresponding R code for the prediction of *c.neti.bis* in *data1* is:

# sequential algorithm
mod1_seq = OT_outcome(match_var, nominal = c(1,5:6), ordinal = 2:4, dist.choice = "E",
                      maxrelax = 0 , indiv.method = "sequential", which.DB = "A")
summary(mod1_seq)

# optimal algorithm with no relaxation term
mod2_opt = OT_outcome(match_var, nominal = c(1,5:6), ordinal = 2:4, dist.choice = "E",
                      maxrelax = 0, indiv.method = "optimal", which.DB = "A")
head(mod2_opt$profile)

dim(mod2_opt$profile)

mod2_opt$gamma_A

head(mod2_opt$DATA1_OT)

As input:

The target c.neti.bis is here considered as an ordinal scale but sometimes this information can be unknown: In the situation where there is no information about the encoding, is recommended to consider it as nominal.
There is many possible distance functions integrated in OT_outcome to evaluate the distances between individuals: Here, the Euclidean distance have been chosen.
The percent.knn argument corresponds to the fixed part (from 0 to 1) of individuals that participate to the computations of average distances between levels of target variables.

In output, these two models provides lists of same structure:

gamma_A stores the solution of the optimization problem to determine the missing scales of c.neti.bis in data1.
The profile object gives the profiles of covariates detected by the algorithm from the two databases: Here we have 29 profiles.
estimatorZA is an array that provides for each profile, the probability for c.neti.bis to belong to a level given the covariates and the corresponding level of c.neti.
The DATA1_OT object provides the individual predictions of c.neti.bis in data1 (variable OTpred of the data.frame).

B) Transporting target and shared variables to predict the missing scales

The objective of the algorithm integrated in the **OT\_joint** function is the same as that of the **OT\_outcome** function, but unlike **OT\_outcome** transfers only the distributions of the target variables to solve the optimization problem, the **OT\_joint** function now transports the joint distribution of outcomes and covariates: This approach makes it possible to no longer assumes the strong distributional hypothesis mentionned previously.

Nevertheless, as with the **OT\_outcome** function, enrichments have been proposed to relax distributional constraints (see *maxrelax* argument and (2)) and eventually adds to the algorithm a regularization term to smooth the variations of target variables with respect to matching variables (see *lambda.reg* arguments).

To guarantee the convergence of the algorithm: - It is suggested to work with a limited number of matching variables and so to select them rigorously using *select\_pred* or another process of selection. - Choosing a small value of the *prox.X* argument improves the convergence and the running time of the function. For our example, the corresponding R code for the prediction of *c.neti.bis* in *data1* is:

# Algorithms with no enrichments
mod3_joint = OT_joint(match_var, nominal = c(1,5), ordinal = c(2:4,6), dist.choice = "E", 
                      prox.X = 0.10, which.DB = "A")
summary(mod3_joint)

For a better understanding, the input arguments and output objects of the **OT\_joint** function have been thought to be similar to those proposed by the **OT\_outcome** function. As previously, the *OTpred* variable of the *DATA1\_OT* object stores the individual predictions of *c.neti.bis* in *data1*.

V. Validation of the individual predictions

The **verif\_OT** function gives access to different tools for assessing the reliability of the individual predictions proposed by the **OT\_outcome** and **OT\_joint** functions.

# Validation of the mod1_seq model
verif_out1 = verif_OT(mod1_seq, stab.prob = TRUE, min.neigb = 3)
verif_out1$conf.mat

verif_out1$res.prox

verif_out1$res.stab


# Validation of the mod2_seq model
verif_out2 = verif_OT(mod2_opt, stab.prob = TRUE, min.neigb = 3)
verif_out2$conf.mat
rate_good_pred = (37+40+31+45+18+13+9)/200
rate_good_pred

verif_out2$res.prox

verif_out2$res.stab


# Validation of the mod3_opt model
verif_jt   = verif_OT(mod3_joint, stab.prob = TRUE, min.neigb = 3)
verif_jt$conf.mat

verif_jt$res.prox

verif_jt$res.stab

As input: - When the *stab.prob* argument is set to *TRUE*, the function provides information about the average stability of the individual predictions. - The *min.neigb* argument specifies that individuals whose probabilities of assignments have been estimated using less than 3 neighbors are dropped from this average (see the documentation of the function for more details).

As output, we have selected the most interesting one to make the comparison between the three tested models: - The *conf.mat* object provides the confusion matrix between *c.neti* and the individual predictions of *c.neti.bis* in *data1*. Here the two target variables *c.neti* and *c.neti.bis* are supposed to summarize a same information, so if the two scales are ordered, a clear structure in the confusion matrix traduces a good logical in the prediction. In our example, by construction, we also know the real encoding of *c.neti.bis*: We can so easily verify that the rate of good predictions equals 0.965 for the *mod2\_opt* model, 0.93 for the *mod1\_seq* model and 0.635 for the *mod3\_jt* model. - Studying the proximity between the distributions of *c.neti* and *c.neti.bis* in *data1* via the V Cramer criterion (see *res.prox* object) confirms that the *mod2\_opt* model is the more adapted for our example. - Finally, studying the stability of the predictions (see *res.stab* object) shows good results whatever the considered model.

CONCLUSION

The two first models are here more adapted when the strong distributional hypothesis previously introduced holds. Nevertheless, the mod3_jt model could be significantly improved by adding appropriate relaxation and regularization terms, so now, R user, test it by yourself !

References

(1) Gares V, Dimeglio C, Guernec G, Fantin F, Lepage B, Korosok MR, savy N (2019). On the use of optimal transportation theory to recode variables and application to database merging. The International Journal of Biostatistics.Volume 16, Issue 1, 20180106, eISSN 1557-4679.

(2) Gares V, Omer J (2020). Regularized optimal transport of covariates and outcomes in data recoding. Journal of the American Statistical Association.

(3) D’Orazio, M (2015). Integration and imputation of survey data in R: the StatMatch package. Romanian Statistical Review, vol. 63(2), pages 57-68

(4) Strobl C, Boulesteix A-L, Kneib T, Augustin T, Zeileis A (2008). Conditional Variable Importance for Random Forests. BMC Bioinformatics, 9, 307 - https://bmcbioinformatics.biomedcentral.com/articles/10.1186/1471-2105-9-307