comp.cont | R Documentation |
This function estimates the “closeness” of distributions of the same continuous variable(s) but estimated from different data sources.
comp.cont(data.A, data.B, xlab.A, xlab.B = NULL, w.A = NULL, w.B = NULL, ref = FALSE)
data.A |
A dataframe or matrix containing the variable of interest |
data.B |
A dataframe or matrix containing the variable of interest |
xlab.A |
Character string providing the name of the variable in |
xlab.B |
Character string providing the name of the variable in |
w.A |
Character string providing the name of the optional weighting variable in |
w.B |
Character string providing the name of the optional weighting variable in |
ref |
Logical. When |
As a first output, the function returns some well–known summary measures (min, Q1, median, mean, Q3, max and sd) estimated from the available input data sources.
Secondly this function performs a comparison between the quantiles estimated from data.A
and data.B
; in particular, the average of the absolute value of the differences as well as the average of the squared differences are returned. The number of estimated percentiles depends on the minimum between the two sample sizes. Only quartiles are calculated when min(n.A, n.B)<=50; quintiles are estimated when min(n.A, n.B)>50 and min(n.A, n.B)<=150; deciles are estimated when min(n.A, n.B)>150 and min(n.A, n.B)<=250; finally quantiles for probs=seq(from = 0.05,to = 0.95,by = 0.05)
are estimated when min(n.A, n.B)>250. When the survey weights are available (indicated with th arguments w.A
and/or w.B
) they are used in estimating the quantiles by calling the function wtd.quantile
in the package Hmisc.
The function estimates also the dissimilarities between the estimated empirical distribution function. The measures considered are the maximum of the absolute differences, the sum between the maximum differences inverting the terms in the difference and the average of the absolute value of the differences. When the weights are provided they are used in estimating the empirical cumulative distribution function. Note that when ref=TRUE
the estimation of the density and of the empirical cumulative distribution are guided by the data in data.B
.
The final output is the total variation distance, the overlap and the Hellinger distance calculated considering the transformed categorized variable. The breaks to categorize the variable are decided according to the Freedman-Diaconis rule (nclass
) and, in this case, when ref=TRUE
the IQR is estimated solely on data.B
, whereas with ref=FALSE
it is estimated by joining the two data sources.
When present, the weights are used in estimating the relative frequencies of the categorized variable.
For additional details on these distances please see (comp.prop
)
A list
object with four components.
summary |
A matrix with summaries of |
diff.Qs |
Average of absolute and squared differences between the quantiles of |
dist.ecdf |
Dissimilarity measures between the estimated empirical cumulative distribution functions. |
dist.discr |
Distance between the distributions after discretization of the target variable. |
Marcello D'Orazio mdo.statmatch@gmail.com
Bellhouse D.R. and J. E. Stafford (1999). “Density Estimation from Complex Surveys”. Statistica Sinica, 9, 407–424.
plotCont
, comp.prop
data(samp.A)
data(samp.B)
comp.cont(data.A = samp.A, data.B = samp.B, xlab.A = "age")
comp.cont(data.A = samp.A, data.B = samp.B, xlab.A = "age",
w.A = "ww", w.B = "ww")
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.