Fbwidths.by.x: Computes the Frechet bounds of cells in a contingency table...
In StatMatch: Statistical Matching or Data Fusion

Fbwidths.by.x

R Documentation

Computes the Frechet bounds of cells in a contingency table by considering all the possible subsets of the common variables.

Description

This function permits to compute the bounds for cell probabilities in the contingency table Y vs. Z starting from the marginal tables (X vs. Y), (X vs. Z) and the joint distribution of the X variables, by considering all the possible subsets of the X variables. In this manner it is possible to identify which subset of the X variables produces the major reduction of the average width of conditional bounds.

Usage

Fbwidths.by.x(tab.x, tab.xy, tab.xz, deal.sparse="discard", 
          nA=NULL, nB=NULL, ...)

Arguments

`tab.x`	A R table crossing the X variables. This table must be obtained by using the function `xtabs` or `table`, e.g. `tab.x <- xtabs(~x1+x2+x3, data=data.all)`.
`tab.xy`	A R table of X vs. Y variable. This table must be obtained by using the function `xtabs` or `table`, e.g. `table.xy <- xtabs(~x1+x2+x3+y, data=data.A)`. A single categorical Y variables is allowed. One or more categorical variables can be considered as X variables (common variables). The same X variables in `tab.x` must be available in `tab.xy`. Moreover, it is assumed that the joint distribution of the X variables computed from `tab.xy` is equal to `tab.x`; a warning is produced if this is not true.
`tab.xz`	A R table of X vs. Z variable. This table must be obtained by using the function `xtabs` or `table`, e.g. `tab.xz <- xtabs(~x1+x2+x3+z, data=data.B)`. A single categorical Z variable is allowed. One or more categorical variables can be considered as X variables (common variables). The same X variables in `tab.x` must be available in `tab.xz`. Moreover, it is assumed that the joint distribution of the X variables computed from `tab.xz` is equal to `tab.x`; a warning is produced if this is not true.
`deal.sparse`	Text, how to estimate the cell relative frequencies when dealing with too sparse tables. When `deal.sparse="discard"` (default) no estimation is performed if `tab.xy` or `tab.xz` is too sparse. When `deal.sparse="relfreq"` the standard estimator (cell count divided by the sample size) is considered. Note that here sparseness is measured by number of cells with respect to the sample size; sparse table are those where the number of cells exceeds the sample size (see Details).
`nA`	Integer, sample size of file A used to estimate `tab.xy`. If `NULL`, it is obtained as sum of frequencies in`tab.xy`.
`nB`	Integer, sample size of file B used to estimate `tab.xz`. If `NULL`, it is obtained as sum of frequencies in`tab.xz`.
`...`	Additional arguments that may be required when deriving an estimate of uncertainty by calling `Frechet.bounds.cat`.

Details

This function permits to compute the Frechet bounds for the frequencies in the contingency table of Y vs. Z, starting from the conditional distributions P(Y|X) and P(Z|X) (for details see
Frechet.bounds.cat), by considering all the possible subsets of the X variables. In this manner it is possible to identify the subset of the X variables, with highest association with both Y and Z, that permits to reduce the uncertainty concerning the distribution of Y vs. Z.

The uncertainty is measured by the average of the widths of the bounds for the cells in the table Y vs. Z:

\bar{d} = \frac{1}{J \times K} \sum_{j,k} ( p^{(up)}_{Y=j,Z=k} - p^{(low)}_{Y=j,Z=k} )

For details see Frechet.bounds.cat.

Provided that uncertainty, measured in terms of \bar{d}, tends to reduce when conditioning on a higher number of X variables. Two penalties are introduced to account for the additional number of cells to be estimated when adding a X variable. The first penalty, introduced in D'Orazio et al. (2017), is:

g_1=log\left( 1 + \frac{H_{D_m}}{H_{D_Q}} \right)

Where H_{D_m} is the number of cell in the table obtained by crossing the given subset of X variables and the H_{D_Q} is the number of cell in the table achieved by crossing all the available X variables. A second penalty takes into account the number of cells to estimate with respect to the sample size (D'Orazio et al., 2019). It is obtained as:

g_2 = max \left[ \frac{1}{n_A - H_{D_m} \times J}, \frac{1}{n_B - H_{D_m} \times K} \right]

with n_A > H_{D_m} \times J and n_B > H_{D_m} \times K. In practice, it is considered the number of cells to estimate compared to the sample size. This criterion is considered to measure sparseness too. In particular, for the purposes of this function, tables are NOT considered sparse when:

min\left[ \frac{n_A}{H_{D_m} \times J}, \frac{n_B}{H_{D_m} \times K} \right] > 1

This rule is applied when deciding how to proceed with estimation in case of sparse table (argument deal.sparse). Note that sparseness can be measured in different manners. The outputs include also the empty cells in each table (due to statistical zeros or structural zeros) and the Cohen's effect size with respect to the case of uniform distribution of frequencies across cells (the value 1/no.of.cells in every cell):

\omega_{eq} = \sqrt{H \sum_{h=1}^{H} (\hat{p}_h - 1/H)^2 }

values of \omega_{eq} jointly with n/H \leq 1 usually indicate severe sparseness.

Value

A list with the estimated bounds for the cells in the table of Y vs. Z for each possible subset of the X variables. The final component in the list, sum.unc, is a data.frame that summarizes the main results. In particular, it reports the number of X variables ("x.vars"), the number of cells in each of the input tables and the cells with frequency equal to 0 (columns ending with freq0 ). Moreover, it reported the value ("av.n") of the rule used to decide whether we are dealing with a sparse case (see Details) and the Cohen's effect size measured for the table crossing the considered combination of the X variables. Finally, it is provided the average width of the uncertainty intervals ("av.width"), the penalty terms g1 and g2 ("penalty1" and "penalty2" respectively), and the penalized average widths ("av.width.pen1" and "av.width.pen2", where av.width.pen1=av.width+pen1 and av.width.pen2=av.width+pen2).

Author(s)

Marcello D'Orazio mdo.statmatch@gmail.com

References

D'Orazio, M., Di Zio, M. and Scanu, M. (2006). Statistical Matching: Theory and Practice. Wiley, Chichester.

D'Orazio, M., Di Zio, M. and Scanu, M. (2017). “The use of uncertainty to choose matching variables in statistical matching”. International Journal of Approximate Reasoning , 90, pp. 433-440.

D'Orazio, M., Di Zio, M. and Scanu, M. (2019). “Auxiliary variable selection in a a statistical matching problem”. In Zhang, L.-C. and Chambers, R. L. (eds.) Analysis of Integrated Data, Chapman & Hall/CRC (Forthcoming).

Examples


# un-comment to run
#
# data(quine, package="MASS") #loads quine from MASS
# str(quine)
# quine$c.Days <- cut(quine$Days, c(-1, seq(0,50,10),100))
# table(quine$c.Days)
# 
# 
# # split quine in two subsets
# suppressWarnings(RNGversion("3.5.0"))
# set.seed(4567)
# lab.A <- sample(nrow(quine), 70, replace=TRUE)
# quine.A <- quine[lab.A, 1:4]
# quine.B <- quine[-lab.A, c(1:3,6)]
# 
# # compute the tables required by Fbwidths.by.x()
# freq.xA <- xtabs(~Eth+Sex+Age, data=quine.A)
# freq.xB <- xtabs(~Eth+Sex+Age, data=quine.B)
# 
# freq.xy <- xtabs(~Eth+Sex+Age+Lrn, data=quine.A)
# freq.xz <- xtabs(~Eth+Sex+Age+c.Days, data=quine.B)
# 
# # apply Fbwidths.by.x()
# bounds.yz <- Fbwidths.by.x(tab.x=freq.xA+freq.xB, tab.xy=freq.xy,
#                            tab.xz=freq.xz)
# 
# bounds.yz$sum.unc

StatMatch documentation built on April 3, 2025, 10:03 p.m.