comp.prop | R Documentation |

This function compares two (estimated) distributions of the same categorical variable(s).

comp.prop(p1, p2, n1, n2=NULL, ref=FALSE)

`p1` |
A vector or an array containing relative or absolute frequencies for one or more categorical variables. Usually it is the output of the function |

`p2` |
A vector or an array containing relative or absolute frequencies for one or more categorical variables. Usually it is the output of the function |

`n1` |
The size of the sample on which |

`n2` |
The size of the sample on which |

`ref` |
Logical. When |

This function computes some similarity or dissimilarity measures between marginal (joint) distribution of categorical variables(s). The following measures are considered:

*Dissimilarity index* or *total variation distance*:

*D = (1/2) * sum_j |p_1,j - p_2,j| *

where *p_s,j* are the relative frequencies (*0 <= p_s,j <= 1*). The dissimilarity index ranges from 0 (minimum dissimilarity) to 1. It can be interpreted as the smallest fraction of units that need to be reclassified in order to make the distributions equal. When `p2`

is the reference distribution (true or expected distribution under a given hypothesis) than, following the Agresti's rule of thumb (Agresti 2002, pp. 329â€“330) , values of *D <= 0.03* denotes that the estimated distribution `p1`

follows the true or expected pattern quite closely.

*Overlap* between two distributions:

*O = sum_j min(p_1,j , p_2,j) *

It is a measure of similarity which ranges from 0 to 1 (the distributions are equal). It is worth noting that *O = 1 - D*.

*Bhattacharyya coefficient*:

*B = sum_j sqrt(p_1,j * p_2,j) *

It is a measure of similarity and ranges from 0 to 1 (the distributions are equal).

*Hellinger's distance*:

*d_H = sqrt(1 - B) *

It is a dissimilarity measure ranging from 0 (distributions are equal) to 1 (max dissimilarity). It satisfies all the properties of a distance measure (*0 <= d_H <= 1*; symmetry and triangle inequality).
Hellinger's distance is related to the dissimilarity index, and it is possible to show that:

* d_H^2 <= D <= d_H * sqrt(2) *

Alongside with those similarity/dissimilarity measures the Pearson's Chi-squared is computed. Two formulas are considered. When `p2`

is the reference distribution (true or expected under some hypothesis, `ref=TRUE`

):

* Chi_P = n_1 * sum_j (p_1,j - p_2,j)^2/(p_2,j) *

When `p2`

is a distribution estimated on a second sample then:

* Chi_P = sum_i * sum_j n_i * (p_i,j - p_+,j)^2/(p_+,j) *

where *p_+,j* is the expected frequency for category *j*, obtained as follows:

* p_+,j = (n_1*p_1,j + n_2*p_2,j)/(n_1+n_2) *

being *n_1* and *n_2* the sizes of the samples.

The Chi-Square value can be used to test the hypothesis that two distributions are equal (*df = J-1*). Unfortunately such a test would not be useful when the distribution are estimated from samples selected from a finite population using complex selection schemes (stratification, clustering, etc.). In such a case different alternative corrected Chi-square tests are available (cf. Sarndal et al., 1992, Sec. 13.5). One possibility consist in dividing the Pearson's Chi-square test by the *generalised design effect* of both the surveys. Its estimation is not straightforward (sampling design variables need to be available). Generally speacking, the generalised design effect is smaller than 1 in the presence of stratified random sampling designs, while it exceeds 1 the presence of a two stage cluster sampling design. For the purposes of analysis it is reported the value of the generalised design effect *g* that would determine the acceptance of the null hypothesis (equality of distributions) in the case of *alpha=0.05* (*df = J-1*), i.e. values of *g* such that

* Chi_P/d <= Chi_(J-1,0.05) *

A `list`

object with two or three components depending on the argument `ref`

.

`meas` |
A vector with the measures of similarity/dissimilarity between the distributions: dissimilarity index ( |

`chi.sq` |
A vector with the following values: Pearson's Chi-square ( |

`p.exp` |
When |

Marcello D'Orazio mdo.statmatch@gmail.com

Agresti A (2002) *Categorical Data Analysis. Second Edition*. Wiley, new York.

Sarndal CE, Swensson B, Wretman JH (1992) *Model Assisted Survey Sampling*. Springerâ€“Verlag, New York.

data(quine, package="MASS") #loads quine from MASS str(quine) # split quine in two subsets suppressWarnings(RNGversion("3.5.0")) set.seed(124) lab.A <- sample(nrow(quine), 70, replace=TRUE) quine.A <- quine[lab.A, c("Eth","Sex","Age")] quine.B <- quine[-lab.A, c("Eth","Sex","Age")] # compare est. distributions from 2 samples # 1 variable tt.A <- xtabs(~Age, data=quine.A) tt.B <- xtabs(~Age, data=quine.B) comp.prop(p1=tt.A, p2=tt.B, n1=nrow(quine.A), n2=nrow(quine.B), ref=FALSE) # joint distr. of more variables tt.A <- xtabs(~Eth+Sex+Age, data=quine.A) tt.B <- xtabs(~Eth+Sex+Age, data=quine.B) comp.prop(p1=tt.A, p2=tt.B, n1=nrow(quine.A), n2=nrow(quine.B), ref=FALSE) # compare est. distr. with a one considered as reference tt.A <- xtabs(~Eth+Sex+Age, data=quine.A) tt.all <- xtabs(~Eth+Sex+Age, data=quine) comp.prop(p1=tt.A, p2=tt.all, n1=nrow(quine.A), n2=NULL, ref=TRUE)

Embedding an R snippet on your website

Add the following code to your website.

For more information on customizing the embed code, read Embedding Snippets.