SelMtc.by.unc: Identifies the best combination if matching variables in...
In StatMatch: Statistical Matching or Data Fusion

selMtc.by.unc

R Documentation

Identifies the best combination if matching variables in reducing uncertainty in estimation the contingency table Y vs. Z.

Description

This function identifies the “best” subset of matching variables in terms of reduction of uncertainty when estimating relative frequencies in the contingency table Y vs. Z. The sequential procedure presented in D'Orazio et al. (2017 and 2019) is implemented. This procedure avoids exploring all the possible combinations of the available X variables as in Fbwidths.by.x.

Usage

selMtc.by.unc(tab.x, tab.xy, tab.xz, corr.d=2, 
                    nA=NULL, nB=NULL, align.margins=FALSE)

Arguments

`tab.x`	A R table crossing the X variables. This table must be obtained by using the function `xtabs` or `table`, e.g. `tab.x <- xtabs(~x1+x2+x3, data=data.all)`. A minimum number of 3 variables is needed.
`tab.xy`	A R table of X vs. Y variable. This table must be obtained by using the function `xtabs` or `table`, e.g. `table.xy <- xtabs(~x1+x2+x3+y, data=data.A)`. A single categorical Y variables is allowed. At least three categorical variables should be considered as X variables (common variables). The same X variables in `tab.x` must be available in `tab.xy`. Usually, it is assumed that the joint distribution of the X variables computed from `tab.xy` is equal to `tab.x` (a warning appears if any absolute difference is greater than `tol`). Note that when the marginal distribution of X in `tab.xy` is not equal to that of `tab.x` it is possible to align them before computations (see argument `align.margins`).
`tab.xz`	A R table of X vs. Z variable. This table must be obtained by using the function `xtabs` or `table`, e.g. `tab.xz <- xtabs(~x1+x2+x3+z, data=data.B)`. A single categorical Z variable is allowed. At least three categorical variables should be considered as X variables (common variables). The same X variables in `tab.x` must be available in `tab.xz`. Usually, it is assumed that the joint distribution of the X variables computed from `tab.xz` is equal to `tab.x` (a warning appears if any absolute difference is greater than `tol`). Note that when the marginal distribution of X in `tab.xz` is not equal to that of `tab.x` it is possible to align them before computations (see argument `align.margins`).
`corr.d`	Integer, indicates the penalty that should be introduced in estimating the uncertainty by means of the average width of cell bounds. When `corr.d=1` the penalty being considered is the one introduced in D'Orazio et al. (2017) (i.e. penalty1 in `Fbwidths.by.x`). When `corr.d=2` (default) it is considered a penalty suggested in D'Orazio et al. (2019) (indicated as “penalty2” in `Fbwidths.by.x`). Finally, no penalties are considered when `corr.d=0`.
`nA`	Integer, sample size of file A used to estimate `tab.xy`. If `NULL` is obtained as sum of frequencies in `tab.xy`.
`nB`	Integer, sample size of file B used to estimate `tab.xz`. If `NULL` is obtained as sum of frequencies in `tab.xz`.
`align.margins`	Logical (default `FALSE`). When when `TRUE` the distribution of X variables in `tab.xy` is aligned with the distribution resulting from `tab.x`, without affecting the marginal distribution of Y. Similarly the distribution of X variables in `tab.xz` is aligned with the distribution resulting from `tab.x`, without affecting the marginal distribution of Z. The alignment is performed by running IPF algorithm as implemented in the function `Estimate` in the package mipfp. To avoid lack of convergence due to combinations of Xs encountered in one table but not in the other (statistical 0s), before running IPF a small constant (1e-06) is added to empty cells in `tab.xy` and `tab.xz`.

Details

This function follows the sequential procedure described in D'Orazio et al. (2017, 2019) to identify the combination of common variables most effective in reducing uncertainty when estimating the contingency table Y vs. Z. Initially, the available Xs are ordered according to the reduction of average width of uncertainty bounds when conditioning on each of them. Then in each step one the remaining X variables is added until the table became too sparse; in practice the procedure stops when:

min\left[ \frac{n_A}{H_{D_m} \times J}, \frac{n_B}{H_{D_m} \times K} \right] \leq 1

For major details see also Fbwidths.by.x.

Value

A list with the main outcomes of the procedure.

`ini.ord`	Average width of uncertainty bounds when conditioning on each of the available X variables. Variable most effective in reducing uncertainty comes first. The ordering determines the order in which they are entered in the sequential procedure.
`list.xs`	List with the various combinations of the matching variables being considered in each step.
`av.df`	Data.frame with all the relevant information for each of combination of X variables. The last row corresponds to the combination of the X variables identified as the best in reducing average width of uncertainty bounds (penalized or not depending on the input argument `corr.d`). For each combination of X variables the following additional information are reported: the number of cells (name starts with “`nc`”); the number of empty cells (name starts with “`nc0`”; the average relative frequency (name starts with “`av.crf`”); sparseness measured as Cohen's effect size with respect to equiprobability (uniform distribution across cells). Finally there are the value of the stopping criterion (“`min.av`”), the unconditioned average width of uncertainty bounds (“`avw`”), the penalty term (“`penalty`”) and the penalized width (“`avw.pen`”; `avw.pen=avw+penalty`).

Author(s)

Marcello D'Orazio mdo.statmatch@gmail.com

References

D'Orazio, M., Di Zio, M. and Scanu, M. (2006). Statistical Matching: Theory and Practice. Wiley, Chichester.

D'Orazio, M., Di Zio, M. and Scanu, M. (2017). “The use of uncertainty to choose matching variables in statistical matching”. International Journal of Approximate Reasoning, 90, pp. 433-440.

D'Orazio, M., Di Zio, M. and Scanu, M. (2019). “Auxiliary variable selection in a a statistical matching problem”. In Zhang, L.-C. and Chambers, R. L. (eds.) Analysis of Integrated Data, Chapman & Hall/CRC (forthcoming).

Examples


data(quine, package="MASS") #loads quine from MASS
str(quine)
quine$c.Days <- cut(quine$Days, c(-1, seq(0,50,10),100))
table(quine$c.Days)


# split quine in two subsets
suppressWarnings(RNGversion("3.5.0"))
set.seed(1111)
lab.A <- sample(nrow(quine), 70, replace=TRUE)
quine.A <- quine[lab.A, 1:4]
quine.B <- quine[-lab.A, c(1:3,6)]

# compute the tables required by Fbwidths.by.x()
freq.xA <- xtabs(~Eth+Sex+Age, data=quine.A)
freq.xB <- xtabs(~Eth+Sex+Age, data=quine.B)

freq.xy <- xtabs(~Eth+Sex+Age+Lrn, data=quine.A)
freq.xz <- xtabs(~Eth+Sex+Age+c.Days, data=quine.B)

# apply Fbwidths.by.x()
bb <- Fbwidths.by.x(tab.x=freq.xA+freq.xB, 
                           tab.xy=freq.xy,  tab.xz=freq.xz,
                           warn=FALSE)
bb$sum.unc
cc <- selMtc.by.unc(tab.x=freq.xA+freq.xB, 
                           tab.xy=freq.xy,  tab.xz=freq.xz, corr.d=0)
cc$ini.ord
cc$av.df

StatMatch documentation built on April 3, 2025, 10:03 p.m.