comp.cont: Compares two distributions of the same continuous variable
In StatMatch: Statistical Matching or Data Fusion

comp.cont

R Documentation

Compares two distributions of the same continuous variable

Description

This function estimate the “closeness” of the distributions of the same continuous variable(s) but estimated from different data sources.

Usage

comp.cont(data.A, data.B, xlab.A, xlab.B = NULL, w.A = NULL, 
          w.B = NULL, ref = FALSE)

Arguments

`data.A`	A dataframe or matrix containing the variable of interest `xlab.A` and eventual associated survey weights `w.A`.
`data.B`	A dataframe or matrix containing the variable of interest `xlab.B` and eventual associated survey weights `w.B`.
`xlab.A`	Character string providing the name of the variable in `data.A` whose estimated distribution should be compared with that estimated from `data.B`.
`xlab.B`	Character string providing the name of the variable in `data.B` whose distribution should be compared with that estimated from `data.A`. If `xlab.B=NULL` (default) then it assumed `xlab.B=xlab.A`.
`w.A`	Character string providing the name of the optional weighting variable in `data.A` that, in case, should be used to estimate the distribution of `xlab.A`
`w.B`	Character string providing the name of the optional weighting variable in `data.B` that, in case, should be used to estimate the distribution of `xlab.B`
`ref`	Logical. When `ref = TRUE`, the distribution of `xlab.B` estimated from `data.B` is considered the reference distribution (true or reliable estimate of distribution).

Details

This function calculates well–known summary measures (min, Q1, median, mean, Q3, max and sd) estimated from the available data. It also compares the quantiles estimated from data.A with those estimated from data.B and returns the average of the absolute value of the differences and the average of the squared differences. Note that the number of percentiles estimated depends on the minimum between the two sample sizes. Note that the number of estimated percentiles depends on the minimum between the two sample sizes. Only quartiles are calculated if min(n.A, n.B)<=50; quintiles are estimated if min(n.A, n.B)>50 and min(n.A, n.B)<=150; deciles are estimated if min(n. A, n.B)>150 and min(n.A, n.B)<=250; finally, quantiles for probs=seq(from = 0.05,to = 0.95,by = 0.05) are estimated when min(n.A, n.B)>250. If survey weights are available (indicated by w.A and/or w.B), they are used to estimate the quantiles by calling the function wtd.quantile in the package Hmisc.

The dissimilarities between the estimated empirical distribution functions are calculated. The measures considered are the maximum value of the differences, the sum of the absolute values of the minimum and maximum, and the average of the absolute differences. If weights are given, they are used in the estimation of the empirical cumulative distribution function. Note that when ref=TRUE is given, the estimation of the density and the empirical cumulative distribution will be guided by the data in data.B.

Finally, the total variation distance, the overlap and the Hellinger are calculated on the transformed categorised variable. Note that the breaks to categorise the variable are decided according to the Freedman-Diaconis rule (nclass) and, in this case, with ref=TRUE the IQR is estimated on data.B alone, whereas with ref=FALSE it is estimated by combining the two data sources. If present, the weights are used to estimate the relative frequencies of the categorised variable. total variation distance:

\Delta_{AB} = \frac{1}{2} \sum_{j=1}^J \left| p_{A,j} - p_{B,j} \right|

where p_{s,j} are the relative frequencies (0 \leq p_{s,j} \leq 1). The dissimilarity index ranges from 0 (minimum dissimilarity) to 1. The total variation distance comes along with its complement to 1, said “overlap” between distributions.

the Hellinger's distance:

d_{H,AB} = \sqrt{ \frac{1}{2} \sum_{j=1}^J \left( \sqrt{p_{A,j}} - \sqrt{p_{B,j}} \right)^2 }

It is a dissimilarity measure ranging from 0 (distributions are equal) to 1 (max dissimilarity). It satisfies all the properties of a distance measure (0 \leq d_{H,AB} \leq 1; symmetry and triangle inequality). Hellinger's distance is related to the total variation distance, and it is possible to show that:

d_{H,AB}^2 \leq \Delta_{AB} \leq d_{H,AB}\sqrt{2}

Value

A list object with four components.

`summary`	A matrix with summaries of `xlab.A` estimated on `data.A` and summaries of `xlab.B` estimated on `data.B`
`diff.Qs`	Average of absolute and squared differences between the quantiles of `xlab.A` estimated on `data.A` and the corresponding ones of `xlab.B` estimated on `data.B`
`dist.ecdf`	Dissimilarity measures between the estimated empirical cumulative distribution functions.
`dist.discr`	Distance between the distributions after discretization of the target variable.

Author(s)

Marcello D'Orazio mdo.statmatch@gmail.com

References

Bellhouse D.R. and J. E. Stafford (1999). “Density Estimation from Complex Surveys”. Statistica Sinica, 9, 407–424.

StatMatch
Statistical Matching or Data Fusion

comp.cont: Compares two distributions of the same continuous variable
In StatMatch: Statistical Matching or Data Fusion

Compares two distributions of the same continuous variable

Description

Usage

Arguments

Details

Value

Author(s)

References

See Also

Related to comp.cont in StatMatch...

R Package Documentation

Browse R Packages

We want your feedback!

StatMatch Statistical Matching or Data Fusion

comp.cont: Compares two distributions of the same continuous variable In StatMatch: Statistical Matching or Data Fusion

Compares two distributions of the same continuous variable

Description

Usage

Arguments

Details

Value

Author(s)

References

See Also

Related to comp.cont in StatMatch...

R Package Documentation

Browse R Packages

We want your feedback!

StatMatch
Statistical Matching or Data Fusion

comp.cont: Compares two distributions of the same continuous variable
In StatMatch: Statistical Matching or Data Fusion