# comp.prop: Compares two distributions of the same categorical variable In StatMatch: Statistical Matching or Data Fusion

 comp.prop R Documentation

## Compares two distributions of the same categorical variable

### Description

This function compares two (estimated) distributions of the same categorical variable(s).

### Usage

```comp.prop(p1, p2, n1, n2=NULL, ref=FALSE)
```

### Arguments

 `p1` A vector or an array containing relative or absolute frequencies for one or more categorical variables. Usually it is the output of the function `xtabs` or `table`. `p2` A vector or an array containing relative or absolute frequencies for one or more categorical variables. Usually it is the output of the function `xtabs` or `table`. If `ref = FALSE` then `p2` is a further estimate of the distribution of the categorical variable(s) being considered. On the contrary (`ref = TRUE`) it is the 'reference' distribution (the distribution considered true or a reliable estimate). `n1` The size of the sample on which `p1` has been estimated. `n2` The size of the sample on which `p2` has been estimated, required just when `ref = FALSE` (`p2` is estimated on another sample and is not the reference distribution). `ref` Logical. When `ref = TRUE`, `p2` is the reference distribution (true or reliable estimate of distribution), on the contrary when `ref = FALSE` it an estimate of the distribution derived from another sample with sample size `n2`.

### Details

This function computes some similarity or dissimilarity measures between marginal (joint) distribution of categorical variables(s). The following measures are considered:

Dissimilarity index or total variation distance:

D = (1/2) * sum_j |p_1,j - p_2,j|

where p_s,j are the relative frequencies (0 <= p_s,j <= 1). The dissimilarity index ranges from 0 (minimum dissimilarity) to 1. It can be interpreted as the smallest fraction of units that need to be reclassified in order to make the distributions equal. When `p2` is the reference distribution (true or expected distribution under a given hypothesis) than, following the Agresti's rule of thumb (Agresti 2002, pp. 329–330) , values of D <= 0.03 denotes that the estimated distribution `p1` follows the true or expected pattern quite closely.

Overlap between two distributions:

O = sum_j min(p_1,j , p_2,j)

It is a measure of similarity which ranges from 0 to 1 (the distributions are equal). It is worth noting that O = 1 - D.

Bhattacharyya coefficient:

B = sum_j sqrt(p_1,j * p_2,j)

It is a measure of similarity and ranges from 0 to 1 (the distributions are equal).

Hellinger's distance:

d_H = sqrt(1 - B)

It is a dissimilarity measure ranging from 0 (distributions are equal) to 1 (max dissimilarity). It satisfies all the properties of a distance measure (0 <= d_H <= 1; symmetry and triangle inequality). Hellinger's distance is related to the dissimilarity index, and it is possible to show that:

d_H^2 <= D <= d_H * sqrt(2)

Alongside with those similarity/dissimilarity measures the Pearson's Chi-squared is computed. Two formulas are considered. When `p2` is the reference distribution (true or expected under some hypothesis, `ref=TRUE`):

Chi_P = n_1 * sum_j (p_1,j - p_2,j)^2/(p_2,j)

When `p2` is a distribution estimated on a second sample then:

Chi_P = sum_i * sum_j n_i * (p_i,j - p_+,j)^2/(p_+,j)

where p_+,j is the expected frequency for category j, obtained as follows:

p_+,j = (n_1*p_1,j + n_2*p_2,j)/(n_1+n_2)

being n_1 and n_2 the sizes of the samples.

The Chi-Square value can be used to test the hypothesis that two distributions are equal (df = J-1). Unfortunately such a test would not be useful when the distribution are estimated from samples selected from a finite population using complex selection schemes (stratification, clustering, etc.). In such a case different alternative corrected Chi-square tests are available (cf. Sarndal et al., 1992, Sec. 13.5). One possibility consist in dividing the Pearson's Chi-square test by the generalised design effect of both the surveys. Its estimation is not straightforward (sampling design variables need to be available). Generally speacking, the generalised design effect is smaller than 1 in the presence of stratified random sampling designs, while it exceeds 1 the presence of a two stage cluster sampling design. For the purposes of analysis it is reported the value of the generalised design effect g that would determine the acceptance of the null hypothesis (equality of distributions) in the case of alpha=0.05 (df = J-1), i.e. values of g such that

Chi_P/d <= Chi_(J-1,0.05)

### Value

A `list` object with two or three components depending on the argument `ref`.

 `meas` A vector with the measures of similarity/dissimilarity between the distributions: dissimilarity index (`"tvd"`), overlap (`"overlap"`), Bhattacharyya coefficient (`"Bhatt"`) and Hellinger's distance (`"Hell"`). `chi.sq` A vector with the following values: Pearson's Chi-square (`"Pearson"`), the degrees of freedom (`"df"`), the percentile of a Chi-squared distribution (`"q0.05"`) and the largest admissible value of the generalised design effect that would determine the acceptance of H0 (equality of distributions). `p.exp` When `ref=FALSE` it is reported the value of the reference distribution p_+,j estimated used in deriving the Chi-square statistic and also the dissimilarity index. On the contrary (`ref=FALSE`) it is set equal to the argument `p2`.

### Author(s)

Marcello D'Orazio mdo.statmatch@gmail.com

### References

Agresti A (2002) Categorical Data Analysis. Second Edition. Wiley, new York.

Sarndal CE, Swensson B, Wretman JH (1992) Model Assisted Survey Sampling. Springer–Verlag, New York.

### Examples

```data(quine, package="MASS") #loads quine from MASS
str(quine)

# split quine in two subsets
suppressWarnings(RNGversion("3.5.0"))
set.seed(124)
lab.A <- sample(nrow(quine), 70, replace=TRUE)
quine.A <- quine[lab.A, c("Eth","Sex","Age")]
quine.B <- quine[-lab.A, c("Eth","Sex","Age")]

# compare est. distributions from 2 samples
# 1 variable
tt.A <- xtabs(~Age, data=quine.A)
tt.B <- xtabs(~Age, data=quine.B)
comp.prop(p1=tt.A, p2=tt.B, n1=nrow(quine.A), n2=nrow(quine.B), ref=FALSE)

# joint distr. of more variables
tt.A <- xtabs(~Eth+Sex+Age, data=quine.A)
tt.B <- xtabs(~Eth+Sex+Age, data=quine.B)
comp.prop(p1=tt.A, p2=tt.B, n1=nrow(quine.A), n2=nrow(quine.B), ref=FALSE)

# compare est. distr. with a one considered as reference
tt.A <- xtabs(~Eth+Sex+Age, data=quine.A)
tt.all <- xtabs(~Eth+Sex+Age, data=quine)
comp.prop(p1=tt.A, p2=tt.all, n1=nrow(quine.A), n2=NULL, ref=TRUE)

```

StatMatch documentation built on March 18, 2022, 6:55 p.m.