mixed.mtc | R Documentation |

This function implements some mixed methods to perform statistical matching between two data sources.

mixed.mtc(data.rec, data.don, match.vars, y.rec, z.don, method="ML", rho.yz=NULL, micro=FALSE, constr.alg="Hungarian")

`data.rec` |
A matrix or data frame that plays the role of |

`data.don` |
A matrix or data frame that plays the role of |

`match.vars` |
A character vector with the names of the common variables (the columns in both the data frames) to be used as matching variables ( |

`y.rec` |
A character vector with the name of the target variable Y that is observed only for units in |

`z.don` |
A character vector with the name of the target variable Z that is observed only for units in |

`method` |
A character vector that identifies the method that should be used to estimate the parameters of the regression models: Y vs. |

`rho.yz` |
A numeric value representing a guess for the correlation between the Y ( By default ( |

`micro` |
Logical. When |

`constr.alg` |
A string that has to be specified when |

This function implements some mixed methods to perform statistical matching. A mixed method consists of two steps:

(1) adoption of a parametric model for the joint distribution of *(\bold{X},Y,Z)* and estimation of its parameters;

(2) derivation of a complete “synthetic” data set (recipient data set filled in with values for the Z variable) using a nonparametric approach.

In this case, as far as (1) is concerned, it is assumed that *(\bold{X},Y,Z)* follows a multivariate normal distribution. Please note that if some of the **X** are categorical, then they are recoded into dummies before starting with the estimation. In such a case, the assumption of multivariate normal distribution may be questionable.

The whole procedure is based on the imputation method known as *predictive mean matching*. The procedure consists of three steps:

**step 1a)** *Regression step*: the two linear regression models Y vs. **X** and Z vs. **X** are considered and their parameters are estimated.

**step 1b)** *Computation of intermediate values*. For the units in `data.rec`

the following intermediate values are derived:

*z_a = alpha_Z + beta_ZX * x_a + e_a *

for each *a=1,...,n_A*, being *n_A* the number of units in `data.rec`

(rows of `data.rec`

). Note that, *e_a* is a random draw from the multivariate normal distribution with zero mean and estimated residual variance *sigma_ZX*.

Similarly, for the units in `data.don`

the following intermediate values are derived:

* y_b = alpha_Y + beta_YX * x_b + e_b *

for each *1,...,n_B*, being *n_B* the number of units in `data.don`

(rows of `data.don`

). *e_b* is a random draw from the multivariate normal distribution with zero mean and estimated residual variance *sigma_YX*.

**step 2)** *Matching step*. For each observation (row) in `data.rec`

a donor is chosen in `data.don`

through a nearest neighbor constrained distance hot deck procedure. The distances are computed between *(y_a, z^_a)* and *(y^_b, z_b)* using Mahalanobis distance.

For further details see Sections 2.5.1 and 3.6.1 in D'Orazio *et al.* (2006).

In step 1a) the parameters of the regression model can be estimated by means of the Maximum Likelihood method (`method="ML"`

) (see D'Orazio *et al.*, 2006, pp. 19–23,73–75) or, using the Moriarity and Scheuren (2001 and 2003) approach (`method="MS"`

) (see also D'Orazio *et al.*, 2006, pp. 75–76). The two estimation methods are compared in D'Orazio *et al.* (2005).

When `method="MS"`

, if the value specified for the argument `rho.yz`

is not compatible with the other correlation coefficients estimated from the data, then it is substituted with the closest value compatible with the other estimated coefficients.

When `micro=FALSE`

only the estimation of the parameters is performed (step 1a). Otherwise,

(`micro=TRUE`

) the whole procedure is carried out.

A list with a varying number of components depending on the values of the arguments
`method`

and `rho.yz`

.

`mu` |
The estimated mean vector. |

`vc` |
The estimated variance–covariance matrix. |

`cor` |
The estimated correlation matrix. |

`res.var` |
A vector with estimates of the residual variances |

`start.prho.yz` |
It is the initial guess for the partial correlation coefficient |

`rho.yz` |
Returned in output only when |

`phi` |
When |

`filled.rec` |
The |

`mtc.ids` |
when |

`dist.rd` |
A vector with the distances between each recipient unit and the corresponding donor, returned only in case |

`call` |
How the function has been called. |

Marcello D'Orazio mdo.statmatch@gmail.com

D'Orazio, M., Di Zio, M. and Scanu, M. (2005). “A comparison among different estimators of regression parameters on statistically matched files through an extensive simulation study”, *Contributi*, **2005/10**, Istituto Nazionale di Statistica, Rome.

D'Orazio, M., Di Zio, M. and Scanu, M. (2006). *Statistical Matching: Theory and Practice.* Wiley, Chichester.

Hornik K. (2012). clue: Cluster ensembles. R package version 0.3-45. https://CRAN.R-project.org/package=clue.

Moriarity, C., and Scheuren, F. (2001). “Statistical matching: a paradigm for assessing the uncertainty in the procedure”. *Journal of Official Statistics*, **17**, 407–422.

Moriarity, C., and Scheuren, F. (2003). “A note on Rubin's statistical matching using file concatenation with adjusted weights and multiple imputation”, *Journal of Business and Economic Statistics*, **21**, 65–73.

`NND.hotdeck`

, `mahalanobis.dist`

# reproduce the statistical matching framework # starting from the iris data.frame suppressWarnings(RNGversion("3.5.0")) set.seed(98765) pos <- sample(1:150, 50, replace=FALSE) ir.A <- iris[pos,c(1,3:5)] ir.B <- iris[-pos, 2:5] xx <- intersect(colnames(ir.A), colnames(ir.B)) xx # common variables # ML estimation method under CIA ((rho_YZ|X=0)); # only parameter estimates (micro=FALSE) # only continuous matching variables xx.mtc <- c("Petal.Length", "Petal.Width") mtc.1 <- mixed.mtc(data.rec=ir.A, data.don=ir.B, match.vars=xx.mtc, y.rec="Sepal.Length", z.don="Sepal.Width") # estimated correlation matrix mtc.1$cor # ML estimation method under CIA ((rho_YZ|X=0)); # only parameter estimates (micro=FALSE) # categorical variable 'Species' used as matching variable xx.mtc <- xx mtc.2 <- mixed.mtc(data.rec=ir.A, data.don=ir.B, match.vars=xx.mtc, y.rec="Sepal.Length", z.don="Sepal.Width") # estimated correlation matrix mtc.2$cor # ML estimation method with partial correlation coefficient # set equal to 0.5 (rho_YZ|X=0.5) # only parameter estimates (micro=FALSE) mtc.3 <- mixed.mtc(data.rec=ir.A, data.don=ir.B, match.vars=xx.mtc, y.rec="Sepal.Length", z.don="Sepal.Width", rho.yz=0.5) # estimated correlation matrix mtc.3$cor # ML estimation method with partial correlation coefficient # set equal to 0.5 (rho_YZ|X=0.5) # with imputation step (micro=TRUE) mtc.4 <- mixed.mtc(data.rec=ir.A, data.don=ir.B, match.vars=xx.mtc, y.rec="Sepal.Length", z.don="Sepal.Width", rho.yz=0.5, micro=TRUE, constr.alg="Hungarian") # first rows of data.rec filled in with z head(mtc.4$filled.rec) # # Moriarity and Scheuren estimation method under CIA; # only with parameter estimates (micro=FALSE) mtc.5 <- mixed.mtc(data.rec=ir.A, data.don=ir.B, match.vars=xx.mtc, y.rec="Sepal.Length", z.don="Sepal.Width", method="MS") # the starting value of rho.yz and the value used # in computations mtc.5$rho.yz # estimated correlation matrix mtc.5$cor # Moriarity and Scheuren estimation method # with correlation coefficient set equal to -0.15 (rho_YZ=-0.15) # with imputation step (micro=TRUE) mtc.6 <- mixed.mtc(data.rec=ir.A, data.don=ir.B, match.vars=xx.mtc, y.rec="Sepal.Length", z.don="Sepal.Width", method="MS", rho.yz=-0.15, micro=TRUE, constr.alg="lpSolve") # the starting value of rho.yz and the value used # in computations mtc.6$rho.yz # estimated correlation matrix mtc.6$cor # first rows of data.rec filled in with z imputed values head(mtc.6$filled.rec)

Embedding an R snippet on your website

Add the following code to your website.

For more information on customizing the embed code, read Embedding Snippets.