# Frechet.bounds.cat: Frechet bounds of cells in a contingency table In StatMatch: Statistical Matching or Data Fusion

 Frechet.bounds.cat R Documentation

## Frechet bounds of cells in a contingency table

### Description

This function permits to derive the bounds for cell probabilities of the table Y vs. Z starting from the marginal tables (X vs. Y), (X vs. Z) and the joint distribution of the X variables.

### Usage

```Frechet.bounds.cat(tab.x, tab.xy, tab.xz, print.f="tables", align.margins = FALSE,
tol= 0.001, warn = TRUE)
```

### Arguments

 `tab.x` A R table crossing the X variables. This table must be obtained by using the function `xtabs` or `table`, e.g. `tab.x <- xtabs(~x1+x2+x3, data=data.all)`. When `tab.x = NULL` then only `tab.xy` and `tab.xz` must be supplied. `tab.xy` A R table of X vs. Y variable. This table must be obtained by using the function `xtabs` or `table`, e.g. `table.xy <- xtabs(~x1+x2+x3+y, data=data.A)`. A single categorical Y variable is allowed. One or more categorical variables can be considered as X variables (common variables). Obviously, the same X variables in `tab.x` must be available in `tab.xy`. Usually, it is assumed that the joint distribution of the X variables computed from `tab.xy` is equal to `tab.x` (a warning appears if any absolute difference is greater than `tol`). Note that when marginal distribution of X in `tab.xy` is not equal to that of `tab.x` it is possible to ask their alignment (see argument `align.margins`). When `tab.x = NULL` then `tab.xy` should be a one–dimensional table providing the marginal distribution of the Y variable. `tab.xz` A R table of X vs. Z variable. This table must be obtained by using the function `xtabs` or `table`, e.g. `tab.xz <- xtabs(~x1+x2+x3+z, data=data.B)`. A single categorical Z variable is allowed. One or more categorical variables can be considered as X variables (common variables). The same X variables in `tab.x` must be available in `tab.xz`. Usually, it is assumed that the joint distribution of the X variables computed from `tab.xz` is equal to `tab.x` (a warning appears if any absolute difference is greater than `tol`). Note that when marginal distribution of X in `tab.xz` is not equal to that of `tab.x` it is possible to ask their alignment (see argument `align.margins`). When `tab.x = NULL` then `tab.xz` should be a one–dimensional table providing the marginal distribution of the Z variable. `print.f` A string: when `print.f="tables"` (default) all the cells' estimates will be saved as tables in a list. On the contrary, if `print.f="data.frame"`, they will be saved as columns of a data.frame. `align.margins` Logical (default `FALSE`). When when `TRUE` the distribution of X variables in `tab.xy` is aligned with the distribution resulting from `tab.x`, without affecting the marginal distribution of Y. Similarly, the distribution of X variables in `tab.xz` is aligned with the distribution resulting from `tab.x` without affecting the marginal distribution of Z. The alignment is performed by running IPF algorithm as implemented in the function `Estimate` in the package mipfp. Note that to avoid lack of convergence due to combinations of Xs encountered in one table but not in the other (statistical 0s), before running IPF a small constant (1e-06) is added to empty cells in `tab.xy` and `tab.xz`. `tol` Tolerance used in comparing joint distributions as far as X variables are considered (default `tol= 0.001`); estimation of cells bounds would require that distribution of X variables computed from `tab.xy` and `tab.xz` should be approximately equal to that in `tab.x`, on contrary incoherences in estimated cells' bounds could happen. In case of not-coherent marginal distributions it is suggested to get them aligned by setting `align.margins=TRUE`. `warn` Logical, when `TRUE` (default) return warnings when marginal distributions of X variables show differences grater than `tol`.

### Details

This function permits to compute the expected conditional Frechet bounds for the relative frequencies in the contingency table of Y vs. Z, starting from the distributions P(Y|X), P(Z|X) and P(X). The expected conditional bounds for the relative frequencies p(y=j,z=k) in the table Y vs. Z are:

p(Y=j,Z=k) >= sum_i(p(X=i) * max(0; p(Y=j|X=i) + p(Z=k|X=i) - 1) )

p(Y=j,Z=k) <= sum_i(p(X=i) * min(p(Y=j|X=i),p(Z=k|X=i)))

The relative frequencies p(X=i)=n_i/n are computed from the frequencies in `tab.x`;
the relative frequencies p(Y=j|X=i)=n_ij/n_i. are derived from `tab.xy`,
finally, p(Z=k|X=i)=n_ik/n_i. are derived from `tab.xz`.

Estimation requires that all the starting tables share the same marginal distribution of the X variables.

This function returns also the unconditional bounds for the relative frequencies in the contingency table of Y vs. Z, i.e. computed also without considering the X variables:

max(0;p(Y=j)+p(Z=k)-1) <= p(Y=j,Z=k) <= min(p(Y=j);p(Z=k))

These bounds represent the unique output when `tab.x = NULL`.

Finally, the contingency table of Y vs. Z estimated under the Conditional Independence Assumption (CIA) is obtained by considering:

p(Y=i,Z=k) = p(Y=j|X=i)*p(Z=k|X=i)*p(X=i)

When `tab.x = NULL` then it is also provided the expected table under the assumption of independence between Y and Z:

p(Y=i,Z=k) = p(Y=j)*p(Z=k)*

The presence of too many cells with 0s in the input contingency tables is an indication of sparseness; this is an unappealing situation when estimating the cells' relative frequencies needed to derive the bounds; in such cases the corresponding results may be unreliable. A possible alternative way of working consists in estimating the required parameters by considering a pseudo-Bayes estimator (see `pBayes`); in practice the input `tab.x`, `tab.xy` and `tab.xz` should be the ones provided by the `pBayes` function.

### Value

When `print.f="tables"` (default) a list with the following components:

 `low.u` The estimated lower bounds for the relative frequencies in the table Y vs. Z without conditioning on the X variables. `up.u` The estimated upper bounds for the relative frequencies in the table Y vs. Z without conditioning on the X variables. `CIA` The estimated relative frequencies in the table Y vs. Z under the Conditional Independence Assumption (CIA). `low.cx` The estimated lower bounds for the relative frequencies in the table Y vs. Z when conditioning on the X variables. `up.cx` The estimated upper bounds for the relative frequencies in the table Y vs. Z when conditioning on the X variables. `uncertainty` The uncertainty associated to input data, measured in terms of average width of uncertainty bounds with and without conditioning on the X variables.

When `print.f="data.frame"` the output list contains just two components:

 `bounds` A data.frame whose columns reports the estimated uncertainty bounds. `uncertainty` The uncertainty associated to input data, measured in terms of average width of uncertainty bounds with and without conditioning on the X variables.

### Author(s)

Marcello D'Orazio mdo.statmatch@gmail.com

### References

D'Orazio, M., Di Zio, M. and Scanu, M. (2006) “Statistical Matching for Categorical Data: Displaying Uncertainty and Using Logical Constraints”, Journal of Official Statistics, 22, pp. 137–157.

D'Orazio, M., Di Zio, M. and Scanu, M. (2006). Statistical Matching: Theory and Practice. Wiley, Chichester.

`Fbwidths.by.x`, `harmonize.x`

### Examples

```
data(quine, package="MASS") #loads quine from MASS
str(quine)

# split quine in two subsets
suppressWarnings(RNGversion("3.5.0"))
set.seed(7654)
lab.A <- sample(nrow(quine), 70, replace=TRUE)
quine.A <- quine[lab.A, 1:3]
quine.B <- quine[-lab.A, 2:4]

# compute the tables required by Frechet.bounds.cat()
freq.xA <- xtabs(~Sex+Age, data=quine.A)
freq.xB <- xtabs(~Sex+Age, data=quine.B)

freq.xy <- xtabs(~Sex+Age+Eth, data=quine.A)
freq.xz <- xtabs(~Sex+Age+Lrn, data=quine.B)

# apply Frechet.bounds.cat()
bounds.yz <- Frechet.bounds.cat(tab.x=freq.xA+freq.xB, tab.xy=freq.xy,
tab.xz=freq.xz, print.f="data.frame")
bounds.yz

# harmonize distr. of Sex vs. Age during computations
# in Frechet.bounds.cat()

#compare marg. distribution of Xs in A and B vs. pooled estimate
comp.prop(p1=margin.table(freq.xy,c(1,2)), p2=freq.xA+freq.xB,
n1=nrow(quine.A), n2=nrow(quine.A)+nrow(quine.B), ref=TRUE)

comp.prop(p1=margin.table(freq.xz,c(1,2)), p2=freq.xA+freq.xB,
n1=nrow(quine.A), n2=nrow(quine.A)+nrow(quine.B), ref=TRUE)

bounds.yz <- Frechet.bounds.cat(tab.x=freq.xA+freq.xB, tab.xy=freq.xy,
tab.xz=freq.xz, print.f="data.frame", align.margins=TRUE)
bounds.yz

```

StatMatch documentation built on March 18, 2022, 6:55 p.m.