Fbwidths.by.x | R Documentation |

This function permits to compute the bounds for cell probabilities in the contingency table Y vs. Z starting from the marginal tables (**X** vs. Y), (**X** vs. Z) and the joint distribution of the **X** variables, by considering all the possible subsets of the **X** variables. In this manner it is possible to identify which subset of the **X** variables produces the major reduction of the average width of conditional bounds.

Fbwidths.by.x(tab.x, tab.xy, tab.xz, deal.sparse="discard", nA=NULL, nB=NULL, ...)

`tab.x` |
A |

`tab.xy` |
A A single categorical Y variables is allowed. One or more categorical variables can be considered as |

`tab.xz` |
A A single categorical Z variable is allowed. One or more categorical variables can be considered as |

`deal.sparse` |
Text, how to estimate the cell relative frequencies when dealing with too sparse tables. When |

`nA` |
Integer, sample size of file A used to estimate |

`nB` |
Integer, sample size of file B used to estimate |

`...` |
Additional arguments that may be required when deriving an estimate of uncertainty by calling |

This function permits to compute the Frechet bounds for the frequencies in the contingency table of Y vs. Z, starting from the conditional distributions P(Y|**X**) and P(Z|**X**) (for details see

`Frechet.bounds.cat`

), by considering all the possible subsets of the **X** variables. In this manner it is possible to identify the subset of the **X** variables, with highest association with both Y and Z, that permits to reduce the uncertainty concerning the distribution of Y vs. Z.

The uncertainty is measured by the average of the widths of the bounds for the cells in the table Y vs. Z:

*d=(1/(J*K))*sum_(j,k)(p^(up)_(Y=j,Z=k) - p^(low)_(Y=j,Z=k))*

For details see `Frechet.bounds.cat`

.

Provided that uncertainty, measured in terms of *av(d)*, tends to reduce when conditioning on a higher number of **X** variables. Two penalties are introduced to account for the additional number of cells to be estimated when adding a X variable. The first penalty, introduced in D'Orazio et al. (2017), is:

*g1 = log(1 + H_Dm/H_DQ )*

Where *H_Dm* is the number of cell in the table obtained by crossing the given subset of **X** variables and the *H_DQ* is the number of cell in the table achieved by crossing all the available **X** variables.
A second penalty takes into account the number of cells to estimate with respect to the sample size (D'Orazio et al., 2019). It is obtained as:

* g2=max[1/(nA - H_Dm*J), 1/(nB - H_Dm*K)]*

with *nA > H_Dm*J* and *nB > H_Dm*K*. In practice, it is considered the number of cells to estimate compared to the sample size. This criterion is considered to measure sparseness too. In particular, for the purposes of this function, tables are NOT considered sparse when:

* min[ nA/(H_Dm*J), nB/(H_Dm*K) ] > 1*

This rule is applied when deciding how to proceed with estimation in case of sparse table (argument `deal.sparse`

).
Note that sparseness can be measured in different manners. The outputs include also the empty cells in each table (due to statistical zeros or structural zeros) and the Cohen's effect size with respect to the case of uniform distribution of frequencies across cells (the value 1/no.of.cells in every cell):

* w_eq = ( H*sum((p_h - 1/H)^2))^(1/2) *

values of *w_eq>2* jointly with *n/H<=1* usually indicate severe sparseness.

A list with the estimated bounds for the cells in the table of Y vs. Z for each possible subset of the **X** variables. The final component in the list, `sum.unc`

, is a data.frame that summarizes the main results. In particular, it reports the number of **X** variables (`"x.vars"`

), the number of cells in each of the input tables and the cells with frequency equal to 0 (columns ending with `freq0`

). Moreover, it reported the value (`"av.n"`

) of the rule used to decide whether we are dealing with a sparse case (see Details) and the Cohen's effect size measured for the table crossing the considered combination of the X variables.
Finally, it is provided the average width of the uncertainty intervals (`"av.width"`

), the penalty terms g1 and g2 (`"penalty1"`

and `"penalty2"`

respectively), and the penalized average widths (`"av.width.pen1"`

and `"av.width.pen2"`

, where av.width.pen1=av.width+pen1 and av.width.pen2=av.width+pen2).

Marcello D'Orazio mdo.statmatch@gmail.com

D'Orazio, M., Di Zio, M. and Scanu, M. (2006). *Statistical Matching: Theory and Practice.* Wiley, Chichester.

D'Orazio, M., Di Zio, M. and Scanu, M. (2017). “The use of uncertainty to choose matching variables in statistical matching”. *International Journal of Approximate Reasoning *, 90, pp. 433-440.

D'Orazio, M., Di Zio, M. and Scanu, M. (2019). “Auxiliary variable selection in a a statistical matching problem”. In Zhang, L.-C. and Chambers, R. L. (eds.) *Analysis of Integrated Data*, Chapman & Hall/CRC (Forthcoming).

`Frechet.bounds.cat`

, `harmonize.x`

data(quine, package="MASS") #loads quine from MASS str(quine) quine$c.Days <- cut(quine$Days, c(-1, seq(0,50,10),100)) table(quine$c.Days) # split quine in two subsets suppressWarnings(RNGversion("3.5.0")) set.seed(4567) lab.A <- sample(nrow(quine), 70, replace=TRUE) quine.A <- quine[lab.A, 1:4] quine.B <- quine[-lab.A, c(1:3,6)] # compute the tables required by Fbwidths.by.x() freq.xA <- xtabs(~Eth+Sex+Age, data=quine.A) freq.xB <- xtabs(~Eth+Sex+Age, data=quine.B) freq.xy <- xtabs(~Eth+Sex+Age+Lrn, data=quine.A) freq.xz <- xtabs(~Eth+Sex+Age+c.Days, data=quine.B) # apply Fbwidths.by.x() bounds.yz <- Fbwidths.by.x(tab.x=freq.xA+freq.xB, tab.xy=freq.xy, tab.xz=freq.xz) bounds.yz$sum.unc

Embedding an R snippet on your website

Add the following code to your website.

For more information on customizing the embed code, read Embedding Snippets.