Frechet.bounds.cat: Frechet bounds of cells in a contingency table

Frechet.bounds.catR Documentation

Frechet bounds of cells in a contingency table

Description

This function permits to derive the bounds for cell probabilities of the table Y vs. Z starting from the marginal tables (X vs. Y), (X vs. Z) and the joint distribution of the X variables.

Usage

Frechet.bounds.cat(tab.x, tab.xy, tab.xz, print.f="tables", align.margins = FALSE,
                            tol= 0.001, warn = TRUE) 

Arguments

tab.x

A R table crossing the X variables. This table must be obtained by using the function xtabs or table, e.g.
tab.x <- xtabs(~x1+x2+x3, data=data.all).
When tab.x = NULL then only tab.xy and tab.xz must be supplied.

tab.xy

A R table of X vs. Y variable. This table must be obtained by using the function xtabs or table, e.g.
table.xy <- xtabs(~x1+x2+x3+y, data=data.A).

A single categorical Y variable is allowed. One or more categorical variables can be considered as X variables (common variables). Obviously, the same X variables in tab.x must be available in tab.xy. Usually, it is assumed that the joint distribution of the X variables computed from tab.xy is equal to tab.x (a warning appears if any absolute difference is greater than tol). Note that when marginal distribution of X in tab.xy is not equal to that of tab.x it is possible to ask their alignment (see argument align.margins).

When tab.x = NULL then tab.xy should be a one–dimensional table providing the marginal distribution of the Y variable.

tab.xz

A R table of X vs. Z variable. This table must be obtained by using the function xtabs or table, e.g.
tab.xz <- xtabs(~x1+x2+x3+z, data=data.B).

A single categorical Z variable is allowed. One or more categorical variables can be considered as X variables (common variables). The same X variables in tab.x must be available in tab.xz. Usually, it is assumed that the joint distribution of the X variables computed from tab.xz is equal to tab.x (a warning appears if any absolute difference is greater than tol). Note that when marginal distribution of X in tab.xz is not equal to that of tab.x it is possible to ask their alignment (see argument align.margins).

When tab.x = NULL then tab.xz should be a one–dimensional table providing the marginal distribution of the Z variable.

print.f

A string: when print.f="tables" (default) all the cells' estimates will be saved as tables in a list. On the contrary, if print.f="data.frame", they will be saved as columns of a data.frame.

align.margins

Logical (default FALSE). When when TRUE the distribution of X variables in tab.xy is aligned with the distribution resulting from tab.x, without affecting the marginal distribution of Y. Similarly, the distribution of X variables in tab.xz is aligned with the distribution resulting from tab.x without affecting the marginal distribution of Z. The alignment is performed by running IPF algorithm as implemented in the function Estimate in the package mipfp. Note that to avoid lack of convergence due to combinations of Xs encountered in one table but not in the other (statistical 0s), before running IPF a small constant (1e-06) is added to empty cells in tab.xy and tab.xz.

tol

Tolerance used in comparing joint distributions as far as X variables are considered (default tol= 0.001); estimation of cells bounds would require that distribution of X variables computed from tab.xy and tab.xz should be approximately equal to that in tab.x, on contrary incoherences in estimated cells' bounds could happen. In case of not-coherent marginal distributions it is suggested to get them aligned by setting align.margins=TRUE.

warn

Logical, when TRUE (default) return warnings when marginal distributions of X variables show differences grater than tol.

Details

This function permits to compute the expected conditional Frechet bounds for the relative frequencies in the contingency table of Y vs. Z, starting from the distributions P(Y|X), P(Z|X) and P(X). The expected conditional bounds for the relative frequencies p_{j,k} in the table Y vs. Z are:

p^{(low)}_{Y=j,Z=k} = \sum_{i} p_{X=i}\max (0; p_{Y=j|X=i} + p_{Z=k|X=i} - 1 )

p^{(up)}_{Y=j,Z=k} = \sum_{i} p_{X=i} \min ( p_{Y=j|X=i}; p_{Z=k|X=i})

The relative frequencies p_{X=i}=n_i/n are computed from the frequencies in tab.x;
the relative frequencies p_{Y=j|X=i}=n_{ij}/n_{i+} are derived from tab.xy,
finally, p_{Z=k|X=i}=n_{ik}/n_{i+} are derived from tab.xz.

Estimation requires that all the starting tables share the same marginal distribution of the X variables.

This function returns also the unconditional bounds for the relative frequencies in the contingency table of Y vs. Z, i.e. computed also without considering the X variables:

\max\{0; p_{Y=j} + p_{Z=k} - 1\} \leq p_{Y=j,Z=k} \leq \min \{ p_{Y=j}; p_{Z=k}\}

These bounds represent the unique output when tab.x = NULL.

Finally, the contingency table of Y vs. Z estimated under the Conditional Independence Assumption (CIA) is obtained by considering:

p_{Y=j,Z=k} = p_{Y=j|X=i} \times p_{Z=k|X=i} \times p_{X=i}.

When tab.x = NULL then it is also provided the expected table under the assumption of independence between Y and Z:

p_{Y=j,Z=k} = p_{Y=j} \times p_{Z=k}.

The presence of too many cells with 0s in the input contingency tables is an indication of sparseness; this is an unappealing situation when estimating the cells' relative frequencies needed to derive the bounds; in such cases the corresponding results may be unreliable. A possible alternative way of working consists in estimating the required parameters by considering a pseudo-Bayes estimator (see pBayes); in practice the input tab.x, tab.xy and tab.xz should be the ones provided by the pBayes function.

Value

When print.f="tables" (default) a list with the following components:

low.u

The estimated lower bounds for the relative frequencies in the table Y vs. Z without conditioning on the X variables.

up.u

The estimated upper bounds for the relative frequencies in the table Y vs. Z without conditioning on the X variables.

CIA

The estimated relative frequencies in the table Y vs. Z under the Conditional Independence Assumption (CIA).

low.cx

The estimated lower bounds for the relative frequencies in the table Y vs. Z when conditioning on the X variables.

up.cx

The estimated upper bounds for the relative frequencies in the table Y vs. Z when conditioning on the X variables.

uncertainty

The uncertainty associated to input data, measured in terms of average width of uncertainty bounds with and without conditioning on the X variables.

When print.f="data.frame" the output list contains just two components:

bounds

A data.frame whose columns reports the estimated uncertainty bounds.

uncertainty

The uncertainty associated to input data, measured in terms of average width of uncertainty bounds with and without conditioning on the X variables.

Author(s)

Marcello D'Orazio mdo.statmatch@gmail.com

References

D'Orazio, M., Di Zio, M. and Scanu, M. (2006) “Statistical Matching for Categorical Data: Displaying Uncertainty and Using Logical Constraints”, Journal of Official Statistics, 22, pp. 137–157.

D'Orazio, M., Di Zio, M. and Scanu, M. (2006). Statistical Matching: Theory and Practice. Wiley, Chichester.

See Also

Fbwidths.by.x, harmonize.x

Examples


data(quine, package="MASS") #loads quine from MASS
str(quine)

# split quine in two subsets
suppressWarnings(RNGversion("3.5.0"))
set.seed(7654)
lab.A <- sample(nrow(quine), 70, replace=TRUE)
quine.A <- quine[lab.A, 1:3]
quine.B <- quine[-lab.A, 2:4]

# compute the tables required by Frechet.bounds.cat()
freq.xA <- xtabs(~Sex+Age, data=quine.A)
freq.xB <- xtabs(~Sex+Age, data=quine.B)

freq.xy <- xtabs(~Sex+Age+Eth, data=quine.A)
freq.xz <- xtabs(~Sex+Age+Lrn, data=quine.B)

# apply Frechet.bounds.cat()
bounds.yz <- Frechet.bounds.cat(tab.x=freq.xA+freq.xB, tab.xy=freq.xy,
        tab.xz=freq.xz, print.f="data.frame")
bounds.yz

# harmonize distr. of Sex vs. Age during computations
# in Frechet.bounds.cat()

#compare marg. distribution of Xs in A and B vs. pooled estimate
comp.prop(p1=margin.table(freq.xy,c(1,2)), p2=freq.xA+freq.xB, 
          n1=nrow(quine.A), n2=nrow(quine.A)+nrow(quine.B), ref=TRUE)

comp.prop(p1=margin.table(freq.xz,c(1,2)), p2=freq.xA+freq.xB, 
          n1=nrow(quine.A), n2=nrow(quine.A)+nrow(quine.B), ref=TRUE)

bounds.yz <- Frechet.bounds.cat(tab.x=freq.xA+freq.xB, tab.xy=freq.xy,
        tab.xz=freq.xz, print.f="data.frame", align.margins=TRUE)
bounds.yz


StatMatch documentation built on May 29, 2024, 2:15 a.m.