# comp.cont: Empirical comparison of two estimated distributions of the... In StatMatch: Statistical Matching or Data Fusion

 comp.cont R Documentation

## Empirical comparison of two estimated distributions of the same continuous variable

### Description

This function estimates the “closeness” of distributions of the same continuous variable(s) but estimated from different data sources.

### Usage

```comp.cont(data.A, data.B, xlab.A, xlab.B = NULL, w.A = NULL, w.B = NULL, ref = FALSE)
```

### Arguments

 `data.A` A dataframe or matrix containing the variable of interest `xlab.A` and eventual survey weights `w.A`. `data.B` A dataframe or matrix containing the variable of interest `xlab.B` and eventual associated survey weights `w.B`. `xlab.A` Character string providing the name of the variable in `data.A` whose estimated distribution should be compared with that estimated from `data.B`. `xlab.B` Character string providing the name of the variable in `data.B` whose distribution should be compared with that estimated from `data.A`. If `xlab.B=NULL` (default) then it assumed `xlab.B=xlab.A`. `w.A` Character string providing the name of the optional weighting variable in `data.A` that, in case, should be used to estimate the distribution of `xlab.A` `w.B` Character string providing the name of the optional weighting variable in `data.B` that, in case, should be used to estimate the distribution of `xlab.B` `ref` Logical. When `ref = TRUE`, the distribution of `xlab.B` estimated from `data.B` is considered the reference distribution (true or reliable estimate of distribution). Affects some estimation procedures as explained in the Details.

### Details

As a first output, the function returns some well–known summary measures (min, Q1, median, mean, Q3, max and sd) estimated from the available input data sources.

Secondly this function performs a comparison between the quantiles estimated from `data.A` and `data.B`; in particular, the average of the absolute value of the differences as well as the average of the squared differences are returned. The number of estimated percentiles depends on the minimum between the two sample sizes. Only quartiles are calculated when min(n.A, n.B)<20, deciles are estimated when min(n.A, n.B)>=20 and min(n.A, n.B)<=30, finally quantiles for `probs=seq(from = 0.05,to = 0.95,by = 0.05)` are estimated when min(n.A, n.B)>30. When the survey weights are available (indicated with th arguments `w.A` and/or `w.B`) they are used in estimating the quantiles by calling the function `wtd.quantile` in the package Hmisc.

The function estimates also the dissimilarities between the estimated empirical distribution function. The measures considered are the maximum of the absolute differences, the sum between the maximum differences inverting the terms in the difference and the average of the absolute value of the differences. When the weights are provided they are used in estimating the empirical cumulative distribution function. Note that when `ref=TRUE` the estimation of the density and of the empirical cumulative distribution are guided by the data in `data.B`.

The final output is the total variation distance, the overlap and the Hellinger distance calculated considering the transformed categorized variable. The breaks to categorize the variable are decided according to the Freedman-Diaconis rule (`nclass`) and, in this case, when `ref=TRUE` the IQR is estimated solely on `data.B`, whereas with `ref=FALSE` it is estimated by joining the two data sources. When present, the weights are used in estimating the relative frequencies of the categorized variable. For additional details on these distances please see (`comp.prop`)

### Value

A `list` object with four components.

 `summary` A matrix with summaries of `xlab.A` estimated on `data.A` and summaries of `xlab.B` estimated on `data.B` `diff.Qs` Average of absolute and squared differences between the quantiles of `xlab.A` estimated on `data.A` and the corresponding ones of `xlab.B` estimated on `data.B` `dist.ecdf` Dissimilarity measures between the estimated empirical cumulative distribution functions. `dist.discr` Distance between the distributions after discretization of the target variable.

### Author(s)

Marcello D'Orazio mdo.statmatch@gmail.com

### References

Bellhouse D.R. and J. E. Stafford (1999). “Density Estimation from Complex Surveys”. Statistica Sinica, 9, 407–424.

`plotCont`, `comp.prop`

### Examples

```data(samp.A)
data(samp.B)

comp.cont(data.A = samp.A, data.B = samp.B, xlab.A = "age")

comp.cont(data.A = samp.A, data.B = samp.B, xlab.A = "age",
w.A = "ww", w.B = "ww")

```

StatMatch documentation built on March 18, 2022, 6:55 p.m.