comb.samples | R Documentation |

This function permits to cross-tabulate two categorical variables, Y and Z, observed separately in two independent surveys (Y is collected in survey A and Z is collected in survey B) carried out on the same target population. The two surveys share a number of common variables **X**. When it is available a third survey C, carried on the same population, in which both Y and Z are collected, these data are used as a source of auxiliary information.

The statistical matching is performed by carrying out calibration of the survey weights, as suggested in Renssen (1998).

It is possible also to use the function to derive the estimates that a unit falls in one of the categories of the target variable (estimation are based on Liner Probability Models and are obtained as a by-product of the Renssen's method).

comb.samples(svy.A, svy.B, svy.C=NULL, y.lab, z.lab, form.x, estimation=NULL, micro=FALSE, ...)

`svy.A` |
A |

`svy.B` |
A |

`svy.C` |
A When When |

`y.lab` |
A string providing the name of the Y variable, available in survey A and in survey C (if available). The Y variable can be a categorical variable ( |

`z.lab` |
A string providing the name of the Z variable available in survey B and in survey C (if available). The Z variable can be a categorical variable ( |

`form.x` |
A When dealing with categorical variables, To better understand the usage of Due to weights calibration features, it is preferable to work with categorical |

`estimation` |
A character string that identifies the method to be used to estimate the table of Y vs. Z when data from survey C are available. As suggested in Renssen (1998), two alternative methods are available: (i) Incomplete Two-Way Stratification ( |

`micro` |
Logical, when |

`...` |
Further arguments that may be necessary for calibration. In particular, the argument Note that when The number of iterations used in calibration can be modified by using the argument See |

This function estimates the contingency table of Y vs. Z by performing a series of calibrations of the survey weights. In practice the estimation is carried out on data in survey C by exploiting all the information from surveys A and B. When survey C is not available the table of Y vs. Z is estimated under the assumption of Conditional Independence (CIA), i.e. *p(Y,X)=p(Y|X)*p(Z|X)*p(X)*.

When data from survey C are available (Renssen, 1998), the table of Y vs. Z can be estimated by: Incomplete Two-Way Stratification (ITWS) or Synthetic Two-Way Stratification (STWS). In the first case (ITWS) the weights of the units in survey C are calibrated so that the new weights allow to reproduce the marginal distributions of Y estimated on survey A, and that of Z estimated on survey B. Note that the distribution of the **X** variables in survey A and in survey B, must be harmonized before performing ITWS (see `harmonize.x`

).

The Synthetic Two-Way Stratification allows to estimate the table of Y vs. Z by considering also the **X** variables observed in C. This method consists in correcting the table of Y vs. Z estimated under the CIA according to the relationship between Y and Z observed in survey C (for major details see Renssen, 1998).

When the argument `micro`

is set to `TRUE`

the function provides also `Z.A`

and `Y.B`

. The first data.frame has the same rows as `svy.A`

and the number of columns equals the number of categories of the Z variable specified via `z.lab`

. Each row provides the estimated probabilities of assuming a value in the various categories. The same happens for `Y.B`

which presents the estimated probabilities of assuming a category of `y.lab`

for each unit in B. The estimated probabilities are obtained by applying the linear probability models (for major details see Renssen, 1998). Unfortunately, such models may provide estimated probabilities less than 0 or greater than 1. Much caution should be used in using such predictions for practical purposes.

A **R** list with the results of the calibration procedure according to the input arguments.

`yz.CIA` |
The table of Y ( |

`cal.C` |
The survey object |

`yz.est` |
The table of Y ( |

`Z.A` |
Only when |

`Y.B` |
Only when |

`call` |
Stores the call to this function with all the values specified for the various arguments ( |

Marcello D'Orazio mdo.statmatch@gmail.com

D'Orazio, M., Di Zio, M. and Scanu, M. (2006). *Statistical Matching: Theory and Practice*. Wiley, Chichester.

Renssen, R.H. (1998) “Use of Statistical Matching Techniques in Calibration Estimation”. *Survey Methodology*, **24**, pp. 171–183.

`calibrate`

, `svydesign`

, `harmonize.x`

data(quine, package="MASS") #loads quine from MASS str(quine) quine$c.Days <- cut(quine$Days, c(-1, seq(0,20,10),100)) table(quine$c.Days) # split quine in two subsets suppressWarnings(RNGversion("3.5.0")) set.seed(124) lab.A <- sample(nrow(quine), 70, replace=TRUE) quine.A <- quine[lab.A, c("Eth","Sex","Age","Lrn")] quine.B <- quine[-lab.A, c("Eth","Sex","Age","c.Days")] # create svydesign objects require(survey) quine.A$f <- 70/nrow(quine) # sampling fraction quine.B$f <- (nrow(quine)-70)/nrow(quine) svy.qA <- svydesign(~1, fpc=~f, data=quine.A) svy.qB <- svydesign(~1, fpc=~f, data=quine.B) # Harmonizazion wrt the joint distribution # of ('Sex' x 'Age' x 'Eth') # vector of population total known # estimated from the full data set # note the formula! tot.m <- colSums(model.matrix(~Eth:Sex:Age-1, data=quine)) tot.m out.hz <- harmonize.x(svy.A=svy.qA, svy.B=svy.qB, x.tot=tot.m, form.x=~Eth:Sex:Age-1, cal.method="linear") # estimation of 'Lrn' vs. 'c.Days' under the CIA svy.qA.h <- out.hz$cal.A svy.qB.h <- out.hz$cal.B out.1 <- comb.samples(svy.A=svy.qA.h, svy.B=svy.qB.h, svy.C=NULL, y.lab="Lrn", z.lab="c.Days", form.x=~Eth:Sex:Age-1) out.1$yz.CIA addmargins(out.1$yz.CIA) # # incomplete two-way stratification # select a sample C from quine # and define a survey object suppressWarnings(RNGversion("3.5.0")) set.seed(4321) lab.C <- sample(nrow(quine), 50, replace=TRUE) quine.C <- quine[lab.C, c("Lrn","c.Days")] quine.C$f <- 50/nrow(quine) # sampling fraction svy.qC <- svydesign(~1, fpc=~f, data=quine.C) # call comb.samples out.2 <- comb.samples(svy.A=svy.qA.h, svy.B=svy.qB.h, svy.C=svy.qC, y.lab="Lrn", z.lab="c.Days", form.x=~Eth:Sex:Age-1, estimation="incomplete", calfun="linear", maxit=100) summary(weights(out.2$cal.C)) out.2$yz.est # estimated table of 'Lrn' vs. 'c.Days' # difference wrt the table 'Lrn' vs. 'c.Days' under CIA addmargins(out.2$yz.est)-addmargins(out.2$yz.CIA) # synthetic two-way stratification # only macro estimation quine.C <- quine[lab.C, ] quine.C$f <- 50/nrow(quine) # sampling fraction svy.qC <- svydesign(~1, fpc=~f, data=quine.C) out.3 <- comb.samples(svy.A=svy.qA.h, svy.B=svy.qB.h, svy.C=svy.qC, y.lab="Lrn", z.lab="c.Days", form.x=~Eth:Sex:Age-1, estimation="synthetic", calfun="linear",bounds=c(.5,Inf), maxit=100) summary(weights(out.3$cal.C)) out.3$yz.est # estimated table of 'Lrn' vs. 'c.Days' # difference wrt the table of 'Lrn' vs. 'c.Days' under CIA addmargins(out.3$yz.est)-addmargins(out.3$yz.CIA) # diff wrt the table of 'Lrn' vs. 'c.Days' under incomplete 2ws addmargins(out.3$yz.est)-addmargins(out.2$yz.CIA) # synthetic two-way stratification # with micro predictions out.4 <- comb.samples(svy.A=svy.qA.h, svy.B=svy.qB.h, svy.C=svy.qC, y.lab="Lrn", z.lab="c.Days", form.x=~Eth:Sex:Age-1, estimation="synthetic", micro=TRUE, calfun="linear",bounds=c(.5,Inf), maxit=100) head(out.4$Z.A) head(out.4$Y.B)

Embedding an R snippet on your website

Add the following code to your website.

For more information on customizing the embed code, read Embedding Snippets.