View source: R/indiv_grp_closest.R
indiv_grp_closest | R Documentation |
This function sequentially assigns individual predictions using a nearest neighbors procedure to solve recoding problems of data fusion.
indiv_grp_closest( proxim, jointprobaA = NULL, jointprobaB = NULL, percent_closest = 1, which.DB = "BOTH" )
proxim |
a |
jointprobaA |
a matrix whose number of columns corresponds to the number of modalities of the target variable Y in database A, and which number of rows corresponds to the number of modalities of Z in database B. It gives an estimation of the joint probability of (Y,Z) in A. The sum of cells of this matrix must be equal to 1 |
jointprobaB |
a matrix whose number of columns equals to the number of modalities of the target variable Y in database A, and which number of rows corresponds to the number of modalities of Z in database B. It gives an estimation of the joint probability of (Y,Z) in B. The sum of cells of this matrix must be equal to 1 |
percent_closest |
a value between 0 and 1 (by default) corresponding to the fixed |
which.DB |
a character string (with quotes) that indicates which individual predictions need to be computed: only the individual predictions of Y in B ("B"), only those of Z in A ("A") or the both ("BOTH" by default) |
A. THE RECODING PROBLEM IN DATA FUSION
Assuming that Y and Z are two variables which refered to the same target population in two separate databases A and B respectively (no overlapping rows), so that Y and Z are never jointly observed. Assuming also that A and B share a subset of common covariates X of any types (same encodings in A and B) completed or not. Integrating these two databases often requires to solve the recoding problem by creating an unique database where the missing information of Y and Z is fully completed.
B. DESCRIPTION OF THE FUNCTION
The function indiv_grp_closest
is an intermediate function used in the implementation of an algorithm called OUTCOME (and its enrichment R-OUTCOME, see the reference (2) for more details) dedicated to the solving of recoding problems in data fusion using Optimal Transportation theory.
The model is implemented in the function OT_outcome
which integrates the function indiv_grp_closest
in its syntax as a possible second step of the algorithm.
The function indiv_grp_closest
can also be used separately provided that the argument proxim
receives an output object of the function proxim_dist
.
This latter is available in the package and is so directly usable beforehand.
The algorithms OUTCOME
(and R-OUTCOME
) are made of two independent parts. Assuming that the objective consists in the prediction of Z in the database A:
The first part of the algorithm solves the optimization problem by providing a solution called γ that corresponds here to an estimation of the joint distribution (Y,Z) in A.
From the first part, a nearest neighbor procedure is carried out as a second part to provide the individual predictions of Z in A: this procedure is implemented in the function indiv_group_closest
.
In other words, this function sequentially assigns to each individual of A the modality of Z that is closest.
Obviously, this algorithm runs in the same way for the prediction of Y in the database B.
The function indiv_grp_closest
integrates in its syntax the function avg_dist_closest
. Therefore, the related argument percent_closest
is identical in the two functions.
Thus, when computing average distances between an individual i and a subset of individuals assigned to a same level of Y or Z is required, user can decide if all individuals from the subset of interest can participate to the computation (percent_closest
=1) or only a fixed part p (<1) corresponding to the closest neighbors of i (in this case percent_closest
= p).
The arguments jointprobaA
and jointprobaB
correspond to the estimations of γ (sum of cells must be equal to 1) in A and/or B respectively, according to the which.DB
argument.
For example, assuming that n_{Y_1} individuals are assigned to the first modality of Y in A, the objective consists in the individual predictions of Z in A. Then, if jointprobaA
[1,2] = 0.10,
the maximum number of individuals that can be assigned to the second modality of Z in A, can not exceed 0.10 \times n_A.
If n_{Y_1} ≤q 0.10 \times n_A then all individuals assigned to the first modality of Y will be assigned to the second modality of Z.
At the end of the process, each individual with still no affectation will receive the same modality of Z as those of his nearest neighbor in B.
A list of two vectors of numeric values:
YAtrans |
a vector corresponding to the individual predictions of Y (numeric form) in the database B using the Optimal Transportation algorithm |
ZBtrans |
a vector corresponding to the individual predictions of Z (numeric form) in the database A using the Optimal Transportation algorithm |
Gregory Guernec, Valerie Gares, Jeremy Omer
Gares V, Dimeglio C, Guernec G, Fantin F, Lepage B, Korosok MR, savy N (2019). On the use of optimal transportation theory to recode variables and application to database merging. The International Journal of Biostatistics. Volume 16, Issue 1, 20180106, eISSN 1557-4679. doi:10.1515/ijb-2018-0106
Gares V, Omer J (2020) Regularized optimal transport of covariates and outcomes in data recoding. Journal of the American Statistical Association. doi: 10.1080/01621459.2020.1775615
proxim_dist
,avg_dist_closest
, ,OT_outcome
data(simu_data) ### Example with the Manhattan distance man1 <- transfo_dist(simu_data, quanti = c(3, 8), nominal = c(1, 4:5, 7), ordinal = c(2, 6), logic = NULL, prep_choice = "M" ) mat_man1 <- proxim_dist(man1, norm = "M") ### Y(Yb1) and Z(Yb2) are a same information encoded in 2 different forms: ### (3 levels for Y and 5 levels for Z) ### ... Stored in two distinct databases, A and B, respectively ### The marginal distribution of Y in B is unknown, ### as the marginal distribution of Z in A ... # Empirical distribution of Y in database A: freqY <- prop.table(table(man1$Y)) freqY # Empirical distribution of Z in database B freqZ <- prop.table(table(man1$Z)) freqZ # By supposing that the following matrix called transport symbolizes # an estimation of the joint distribution L(Y,Z) ... # Note that, in reality this distribution is UNKNOWN and is # estimated in the OT function by resolving an optimisation problem. transport1 <- matrix(c(0.3625, 0, 0, 0.07083333, 0.05666667, 0, 0, 0.0875, 0, 0, 0.1075, 0, 0, 0.17166667, 0.1433333), ncol = 5, byrow = FALSE) # ... So that the marginal distributions of this object corresponds to freqY and freqZ: apply(transport1, 1, sum) # = freqY apply(transport1, 2, sum) # = freqZ # The affectation of the predicted values of Y in database B and Z in database A # are stored in the following object: pred_man1 <- indiv_grp_closest(mat_man1, jointprobaA = transport1, jointprobaB = transport1, percent_closest = 0.90 ) summary(pred_man1) # For the prediction of Z in A only, add the corresponding argument: pred_man1_A <- indiv_grp_closest(mat_man1, jointprobaA = transport1, jointprobaB = transport1, percent_closest = 0.90, which.DB = "A" )
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.