# harmonize.x: Harmonizes the marginal (joint) distribution of a set of... In StatMatch: Statistical Matching or Data Fusion

 harmonize.x R Documentation

## Harmonizes the marginal (joint) distribution of a set of variables observed independently in two sample surveys referred to the same target population

### Description

This function permits to harmonize the marginal or the joint distribution of a set of variables observed independently in two sample surveys carried out on the same target population. This harmonization is carried out by using the calibration of the survey weights of the sample units in both the surveys according to the procedure suggested by Renssen (1998).

### Usage

```harmonize.x(svy.A, svy.B, form.x, x.tot=NULL,
cal.method="linear", ...)
```

### Arguments

 `svy.A` A `svydesign` R object that stores the data collected in the the survey A and all the information concerning the corresponding sampling design. This object can be created by using the function `svydesign` in the package survey. `svy.B` A `svydesign` R object that stores the data collected in the the survey B and all the information concerning the corresponding sampling design. This object can be created by using the function `svydesign` in the package survey. `form.x` A R formula specifying which of the variables, common to both the surveys, have to be considered, and how have to be considered. For instance `form.x=~x1+x2` means that the marginal distribution of the variables x1 and x2 have to be harmonized and there is also an ‘Intercept’. In order to skip the intercept the formula has to be written in the following manner `form.x=~x1+x2-1`. When dealing with categorical variables, the formula `form.x=~x1:x2-1` means that the harmonization has to be carried out by considering the joint distribution of the two variables (x1 vs. x2). To better understand how `form.x` works see `model.matrix` (see also `formula`). Due to weights calibration features, it is preferable to work with categorical X variables. In some cases, the procedure may be successful when a single continuous variable is considered jointly with one or more categorical variables. When dealing with several continuous variable it may be preferable to categorize them. `x.tot` A vector or table with known population totals for the X variables. A vector is required when `cal.method="linear"` or `cal.method="raking"`. The names and the length of the vector depends on the way it is specified the argument `form.x` (see `model.matrix`). A contingency table is required when `cal.method="poststratify"` (for details see `postStratify`). When `x.tot` is not provided (i.e. `x.tot=NULL`) then the vector of totals is estimated as a weighted average of the totals estimated on the two surveys. The weight assigned to the totals estimated from A is lambda= n_A/(n_A+n_B); 1-lambda is the weight assigned to X totals estimated from survey B (n_A and n_B are the number of units in A and B respectively). `cal.method` A string that specifies how the calibration of the weights in `svy.A` and `svy.B` has to be carried out. By default linear calibration is performed ( `cal.method="linear"`). In particular, the calibration is carried out by mean of the function `calibrate` in the package survey. Alternatively, it is possible to rake the origin survey weights by specifying `cal.method="raking"`. Finally, it is possible to perform a simple post-stratification by setting `cal.method="poststratify"`. Note that in this case the weights adjustments are carried out by considering the function `postStratify` in the package survey. `...` Further arguments that may be necessary for calibration or post-stratification. The number of iterations used in calibration can be modified too by using the argument `maxit` (by default `maxit=50`). See `calibrate` for further details.

### Details

This function harmonizes the totals of the X variables, observed in both survey A and survey B, to be equal to given known totals specified via `x.tot`. When these totals are not known (`x.tot=NULL`) they are estimated by combining the estimates derived from the two separate surveys. The harmonization is carried out according to a procedure suggested by Renssen (1998) based on calibration of survey weights (for major details on calibration see Sarndal and Lundstrom, 2005). The procedure is particularly suited to deal with categorical X variables, in this case it permits to harmonize the joint or the marginal distribution of the categorical variables being considered. Note that an incomplete crossing of the X variables can be considered: i.e. harmonisation wrt to the joint distribution of X_1 x X_2 and the marginal distribution of x_3).

The calibration procedure may not produce the final result due to convergence problems. In this case an error message appears. In order to reach convergence it may be necessary to launch the procedure with less constraints (i.e a reduced number of population totals) by joining adjacent categories or by discarding some variables.

In some limited cases, it could be possible to consider both categorical and continuous variables. In this situation it may happen that calibration is not successful. In order to reach convergence it may be necessary to categorize the continuous X variables.

Post-stratification is a special case of calibration; all the weights of the units in a given post-stratum are modified so as to reproduce the known total for that post-stratum. Post-stratification avoids problems of convergence but, on the other hand, it may produce final weights with a higher variability than those derived from the calibration.

### Value

A R with list the results of calibration procedures carried out on survey A and survey B, respectively. In particular the following components will be provided:

 `cal.A` The survey object `svy.A` after the calibration; in particular, the weights now are calibrated with respect to the totals of the X variables. `cal.B` The survey object `svy.B` after the calibration; in particular, the weights now are calibrated with respect to the totals of the X variables. `weights.A` The new calibrated weights associated to the the units in `svy.A`. `weights.B` The new calibrated weights associated to the the units in `svy.B`. `call` Stores the call to this function with all the values specified for the various arguments (`call=match.call()`).

### Author(s)

Marcello D'Orazio mdo.statmatch@gmail.com

### References

D'Orazio, M., Di Zio, M. and Scanu, M. (2006). Statistical Matching: Theory and Practice. Wiley, Chichester.

Renssen, R.H. (1998) “Use of Statistical Matching Techniques in Calibration Estimation”. Survey Methodology, N. 24, pp. 171–183.

Sarndal, C.E. and Lundstrom, S. (2005) Estimation in Surveys with Nonresponse. Wiley, Chichester.

`comb.samples`, `calibrate`, `svydesign`, `postStratify`,

### Examples

```
data(quine, package="MASS") #loads quine from MASS
str(quine)

# split quine in two subsets
suppressWarnings(RNGversion("3.5.0"))
set.seed(7654)
lab.A <- sample(nrow(quine), 70, replace=TRUE)
quine.A <- quine[lab.A, c("Eth","Sex","Age","Lrn")]
quine.B <- quine[-lab.A, c("Eth","Sex","Age","Days")]

# create svydesign objects
require(survey)
quine.A\$f <- 70/nrow(quine) # sampling fraction
quine.B\$f <- (nrow(quine)-70)/nrow(quine)
svy.qA <- svydesign(~1, fpc=~f, data=quine.A)
svy.qB <- svydesign(~1, fpc=~f, data=quine.B)

#------------------------------------------------------
# example (1)
# Harmonizazion of the distr. of Sex vs. Age
# usign poststratification

# (1.a) known population totals
# the population toatal are computed on the full data frame
tot.sex.age <- xtabs(~Sex+Age, data=quine)
tot.sex.age

out.hz <- harmonize.x(svy.A=svy.qA, svy.B=svy.qB, form.x=~Sex+Age,
x.tot=tot.sex.age, cal.method="poststratify")

tot.A <- xtabs(out.hz\$weights.A~Sex+Age, data=quine.A)
tot.B <- xtabs(out.hz\$weights.B~Sex+Age, data=quine.B)

tot.sex.age-tot.A
tot.sex.age-tot.B

# (1.b) unknown population totals (x.tot=NULL)
# the population total is estimated by combining totals from the
# two surveys

out.hz <- harmonize.x(svy.A=svy.qA, svy.B=svy.qB, form.x=~Sex+Age,
x.tot=NULL, cal.method="poststratify")

tot.A <- xtabs(out.hz\$weights.A~Sex+Age, data=quine.A)
tot.B <- xtabs(out.hz\$weights.B~Sex+Age, data=quine.B)

tot.A
tot.A-tot.B

#-----------------------------------------------------
# example (2)
# Harmonizazion wrt the maginal distribution
# of 'Eth', 'Sex' and 'Age'
# using linear calibration

# (2.a) vector of population total known
# estimated from the full data set
# note the formula! only marginal distribution of the
# variables are considered
tot.m <- colSums(model.matrix(~Eth+Sex+Age-1, data=quine))
tot.m

out.hz <- harmonize.x(svy.A=svy.qA, svy.B=svy.qB, x.tot=tot.m,
form.x=~Eth+Sex+Age-1, cal.method="linear")

summary(out.hz\$weights.A) #check for negative weights
summary(out.hz\$weights.B) #check for negative weights

tot.m
svytable(formula=~Eth, design=out.hz\$cal.A)
svytable(formula=~Eth, design=out.hz\$cal.B)

svytable(formula=~Sex, design=out.hz\$cal.A)
svytable(formula=~Sex, design=out.hz\$cal.B)

# Note: margins are equal but joint distributions are not!
svytable(formula=~Sex+Age, design=out.hz\$cal.A)
svytable(formula=~Sex+Age, design=out.hz\$cal.B)

# (2.b) vector of population total unknown
out.hz <- harmonize.x(svy.A=svy.qA, svy.B=svy.qB, x.tot=NULL,
form.x=~Eth+Sex+Age-1, cal.method="linear")
svytable(formula=~Eth, design=out.hz\$cal.A)
svytable(formula=~Eth, design=out.hz\$cal.B)

svytable(formula=~Sex, design=out.hz\$cal.A)
svytable(formula=~Sex, design=out.hz\$cal.B)

#-----------------------------------------------------
# example (3)
# Harmonizazion wrt the joint distribution of 'Sex' vs. 'Age'
# and the marginal distribution of 'Eth'
# using raking

# vector of population total known
# estimated from the full data set
# note the formula!
tot.m <- colSums(model.matrix(~Eth+(Sex:Age-1)-1, data=quine))
tot.m

out.hz <- harmonize.x(svy.A=svy.qA, svy.B=svy.qB, x.tot=tot.m,
form.x=~Eth+(Sex:Age)-1, cal.method="raking")

summary(out.hz\$weights.A) #check for negative weights
summary(out.hz\$weights.B) #check for negative weights

tot.m
svytable(formula=~Eth, design=out.hz\$cal.A)
svytable(formula=~Eth, design=out.hz\$cal.B)

svytable(formula=~Sex+Age, design=out.hz\$cal.A)
svytable(formula=~Sex+Age, design=out.hz\$cal.B)

```

StatMatch documentation built on March 18, 2022, 6:55 p.m.