rankNND.hotdeck: Rank distance hot deck method.
In StatMatch: Statistical Matching or Data Fusion

rankNND.hotdeck

R Documentation

Rank distance hot deck method.

Description

This function implements rank hot deck distance method. For each recipient record the closest donors is chosen by considering the distance between the percentage points of the empirical cumulative distribution function.

Usage

rankNND.hotdeck(data.rec, data.don, var.rec, var.don=var.rec, 
                 don.class=NULL,  weight.rec=NULL, weight.don=NULL,
                 constrained=FALSE, constr.alg="Hungarian",
                 keep.t=FALSE)

Arguments

`data.rec`	A numeric matrix or data frame that plays the role of recipient. This data frame must contain the variable `var.rec` to be used in computing the percentage points of the empirical cumulative distribution function and eventually the variables that identify the donation classes (see argument `don.class`) and the case weights (see argument `weight.rec`). Missing values (`NA`) are not allowed.
`data.don`	A matrix or data frame that plays the role of donor. This data frame must contain the variable `var.don` to be used in computing percentage points of the the empirical cumulative distribution function and eventually the variables that identify the donation classes (see argument `don.class`) and the case weights (see argument `weight.don`).
`var.rec`	A character vector with the name of the variable in `data.rec` that should be ranked.
`var.don`	A character vector with the name of the variable `data.don` that should be ranked. If not specified, by default `var.don=var.rec`.
`don.class`	A character vector with the names of the variables (columns in both the data frames) that identify donation classes. In each donation class the computation of percentage points is carried out independently. Then only distances between percentage points of the units in the same donation class are computed. The case of empty donation classes should be avoided. It would be preferable that the variables used to form donation classes are defined as `factor`. When not specified (default), no donation classes are used.
`weight.rec`	Eventual name of the variable in `data.rec` that provides the weights that should be used in computing the the empirical cumulative distribution function for `var.rec` (see Details).
`weight.don`	Eventual name of the variable in `data.don` that provides the weights that should be used in computing the the empirical cumulative distribution function for `var.don` (see Details).
`constrained`	Logical. When `constrained=FALSE` (default) each record in `data.don` can be used as a donor more than once. On the contrary, when `constrained=TRUE` each record in `data.don` can be used as a donor only once. In this case, the set of donors is selected by solving a transportation problem, in order to minimize the overall matching distance. See description of the argument `constr.alg` for details.
`constr.alg`	A string that has to be specified when `constrained=TRUE`. Two choices are available: “lpSolve” and “Hungarian”. In the first case, `constr.alg="lpSolve"`, the transportation problem is solved by means of the function `lp.transport` available in the package lpSolve. When `constr.alg="Hungarian"` (default) the transportation problem is solved using the Hungarian method, implemented in function `solve_LSAP` available in the package clue. Note that `constr.alg="Hungarian"` is faster and more efficient.
`keep.t`	Logical, when donation classes are used by setting `keep.t=TRUE` prints information on the donation classes being processed (by default `keep.t=FALSE`).

Details

This function finds a donor record for each record in the recipient data set. The chosen donor is the one at the closest distance in terms of empirical cumulative distribution (Singh et al., 1990). In practice the distance is computed by considering the estimated empirical cumulative distribution for the reference variable (var.rec and var.don) in data.rec and data.don. The empirical cumulative distribution function is estimated by:

\hat{F}(y) = \frac{1}{n} \sum_{i=1}^{n} I(y_i\leq y)

being I()=1 if y_i\leq y and 0 otherwise.

In presence of weights, the empirical cumulative distribution function is estimated by:

\hat{F}(y) = \frac{\sum_{i=1}^{n} w_i I(y_i\leq y)}{\sum_{i=1}^{n} w_i}

In the unconstrained case, when there are more donors at the same distance, one of them is chosen at random.

When the donation class are introduced, then the empirical cumulative distribution function is estimated independently in each donation classes and the search of a recipient is restricted to donors in the same donation class.

A donor can be chosen more than once. To avoid it set constrained=TRUE. In such a case a donor can be chosen just once and the selection of the donors is carried out by solving a transportation problem with the objective of minimizing the overall matching distance (sum of the distances recipient-donor).

Value

A R list with the following components:

`mtc.ids`	A matrix with the same number of rows of `data.rec` and two columns. The first column contains the row names of the `data.rec` and the second column contains the row names of the corresponding donors selected from the `data.don`. When the input matrices do not contain row names, then a numeric matrix with the indexes of the rows is provided.
`dist.rd`	A vector with the distances between each recipient unit and the corresponding donor.
`noad`	The number of available donors at the minimum distance for each recipient unit (only in unconstrained case)
`call`	How the function has been called.

Author(s)

Marcello D'Orazio mdo.statmatch@gmail.com

References

D'Orazio, M., Di Zio, M. and Scanu, M. (2006). Statistical Matching: Theory and Practice. Wiley, Chichester.

Singh, A.C., Mantel, H., Kinack, M. and Rowe, G. (1993). “Statistical matching: use of auxiliary information as an alternative to the conditional independence assumption”. Survey Methodology, 19, 59–79.

Examples


data(samp.A, samp.B, package="StatMatch") #loads data sets

# samp.A plays the role of recipient
?samp.A

# samp.B plays the role of donor
?samp.B


# rankNND.hotdeck()
# donation classes formed using "area5"
# ecdf conputed on "age"
# UNCONSTRAINED case
out.1 <- rankNND.hotdeck(data.rec=samp.A, data.don=samp.B, var.rec="age",
                         don.class="area5")
fused.1 <- create.fused(data.rec=samp.A, data.don=samp.B,
                        mtc.ids=out.1$mtc.ids, z.vars="labour5")
head(fused.1)

#  as before but ecdf estimated  using weights
# UNCONSTRAINED case
out.2 <- rankNND.hotdeck(data.rec=samp.A, data.don=samp.B, var.rec="age",
                         don.class="area5",
                         weight.rec="ww", weight.don="ww")
fused.2 <- create.fused(data.rec=samp.A, data.don=samp.B,
                        mtc.ids=out.2$mtc.ids, z.vars="labour5")
head(fused.2)

StatMatch documentation built on April 3, 2025, 10:03 p.m.