selMtc.by.unc | R Documentation |

This function identifies the “best” subset of matching variables in terms of reduction of uncertainty when estimating relative frequencies in the contingency table Y vs. Z. The sequential procedure presented in D'Orazio *et al.* (2017 and 2019) is implemented. This procedure avoids exploring all the possible combinations of the available X variables as in `Fbwidths.by.x`

.

selMtc.by.unc(tab.x, tab.xy, tab.xz, corr.d=2, nA=NULL, nB=NULL, align.margins=FALSE)

`tab.x` |
A |

`tab.xy` |
A A single categorical Y variables is allowed. At least |

`tab.xz` |
A A single categorical Z variable is allowed. At least |

`corr.d` |
Integer, indicates the penalty that should be introduced in estimating the uncertainty by means of the average width of cell bounds. When |

`nA` |
Integer, sample size of file A used to estimate |

`nB` |
Integer, sample size of file B used to estimate |

`align.margins` |
Logical (default |

This function follows the sequential procedure described in D'Orazio *et al.* (2017, 2019) to identify the combination of common variables most effective in reducing uncertainty when estimating the contingency table Y vs. Z. Initially, the available Xs are ordered according to the reduction of average width of uncertainty bounds when conditioning on each of them. Then in each step one the remaining X variables is added until the table became too sparse; in practice the procedure stops when:

* min[ nA/(H_Dm*J), nB/(H_Dm*K) ] <= 1*

For major details see also `Fbwidths.by.x`

.

A list with the main outcomes of the procedure.

`ini.ord` |
Average width of uncertainty bounds when conditioning on each of the available X variables. Variable most effective in reducing uncertainty comes first. The ordering determines the order in which they are entered in the sequential procedure. |

`list.xs` |
List with the various combinations of the matching variables being considered in each step. |

`av.df` |
Data.frame with all the relevant information for each of combination of X variables. The last row corresponds to the combination of the X variables identified as the best in reducing average width of uncertainty bounds (penalized or not depending on the input argument |

Marcello D'Orazio mdo.statmatch@gmail.com

D'Orazio, M., Di Zio, M. and Scanu, M. (2006). *Statistical Matching: Theory and Practice.* Wiley, Chichester.

D'Orazio, M., Di Zio, M. and Scanu, M. (2017). “The use of uncertainty to choose matching variables in statistical matching”. *International Journal of Approximate Reasoning*, 90, pp. 433-440.

D'Orazio, M., Di Zio, M. and Scanu, M. (2019). “Auxiliary variable selection in a a statistical matching problem”. In Zhang, L.-C. and Chambers, R. L. (eds.) *Analysis of Integrated Data*, Chapman & Hall/CRC (forthcoming).

`Fbwidths.by.x`

, `Frechet.bounds.cat`

data(quine, package="MASS") #loads quine from MASS str(quine) quine$c.Days <- cut(quine$Days, c(-1, seq(0,50,10),100)) table(quine$c.Days) # split quine in two subsets suppressWarnings(RNGversion("3.5.0")) set.seed(1111) lab.A <- sample(nrow(quine), 70, replace=TRUE) quine.A <- quine[lab.A, 1:4] quine.B <- quine[-lab.A, c(1:3,6)] # compute the tables required by Fbwidths.by.x() freq.xA <- xtabs(~Eth+Sex+Age, data=quine.A) freq.xB <- xtabs(~Eth+Sex+Age, data=quine.B) freq.xy <- xtabs(~Eth+Sex+Age+Lrn, data=quine.A) freq.xz <- xtabs(~Eth+Sex+Age+c.Days, data=quine.B) # apply Fbwidths.by.x() bb <- Fbwidths.by.x(tab.x=freq.xA+freq.xB, tab.xy=freq.xy, tab.xz=freq.xz, warn=FALSE) bb$sum.unc cc <- selMtc.by.unc(tab.x=freq.xA+freq.xB, tab.xy=freq.xy, tab.xz=freq.xz, corr.d=0) cc$ini.ord cc$av.df

Embedding an R snippet on your website

Add the following code to your website.

For more information on customizing the embed code, read Embedding Snippets.