NND.hotdeck | R Documentation |
This function implements the distance hot deck method to match the records of two data sources that share some variables.
NND.hotdeck(data.rec, data.don, match.vars,
don.class=NULL, dist.fun="Manhattan",
constrained=FALSE, constr.alg="Hungarian",
k=1, keep.t=FALSE, ...)
data.rec |
A matrix or data frame that plays the role of recipient. This data frame must contain the variables (columns) that should be used, directly or indirectly, in the matching application (specified via Missing values ( |
data.don |
A matrix or data frame that plays the role of donor. The variables (columns) involved, directly or indirectly, in the computation of distance must be the same and of the same type as those in |
match.vars |
A character vector with the names of the matching variables (the columns in both the data frames) that have to be used to compute distances between records (rows) in |
don.class |
A character vector with the names of the variables (columns in both the data frames) that have to be used to identify the donation classes. In this case the computation of distances is limited to those units of The variables chosen for the creation of the donation classes should NOT contain missing values (NAs). When not specified (default), no donation classes are used. This choice may require more memory to store a larger distance matrix and a higher computational effort. |
dist.fun |
A string with the name of the distance function that has to be used. The following distances are allowed: “Manhattan” (aka “City block”; default), “Euclidean”, “Mahalanobis”,“exact” or “exact matching”, “Gower”, “minimax” or one of the distance functions available in the package proxy. Note that the distance is computed using the function When |
constrained |
Logical. When |
constr.alg |
A string that has to be specified when |
k |
The number of times that a unit in |
keep.t |
Logical, when donation classes are used by setting |
... |
Additional arguments that may be required by |
This function finds a donor record in data.don
for each record in data.rec
. In the unconstrained case, it searches for the closest donor record according to the chosen distance function. When for a given recipient record there are more donors available at the minimum distance, one of them is picked at random.
In the constrained case a donor can be used just a fixed number of times, as specified by the k
argument, but the whole set of donors is chosen in order to minimize the overall matching distance. When k=1
the number of units (rows) in the donor data set has to be larger or equal to the number of units of the recipient data set; when the donation classes are used, this condition must be satisfied in each donation class. For further details on nearest neighbor distance hot deck refer to Chapter 2 in D'Orazio et al. (2006).
This function can also be used to impute missing values in a data set using the nearest neighbor distance hot deck. In this case data.rec
is the part of the initial data set that contains missing values on the target variable; on the contrary, data.don
is the part of the data set without missing values on it. See R code in the Examples for details.
Please note that only “Gower” and “minimax” distance functions allow for the presence of missing values (NA
s) in the variables used in computing distances. In both the cases when one of the of the observations presents a variable showing an NA, then this variable is excluded from the computation of distance between them.
A R list with the following components:
mtc.ids |
A matrix with the same number of rows of |
dist.rd |
A vector with the distances between each recipient unit and the corresponding donor. |
noad |
When |
call |
How the function has been called. |
Marcello D'Orazio mdo.statmatch@gmail.com
D'Orazio, M., Di Zio, M. and Scanu, M. (2006). Statistical Matching: Theory and Practice. Wiley, Chichester.
Hornik K. (2012). clue: Cluster ensembles. R package version 0.3-45. https://CRAN.R-project.org/package=clue.
Rodgers, W.L. (1984). “An evaluation of statistical matching”. Journal of Business and Economic Statistics, 2, 91–102.
Singh, A.C., Mantel, H., Kinack, M. and Rowe, G. (1993). “Statistical matching: use of auxiliary information as an alternative to the conditional independence assumption”. Survey Methodology, 19, 59–79.
RANDwNND.hotdeck
# create the classical matching framework
lab <- c(1:15, 51:65, 101:115)
iris.rec <- iris[lab, c(1:3,5)] # recipient data.frame
iris.don <- iris[-lab, c(1:2,4:5)] #donor data.frame
# Now iris.rec and iris.don have the variables
# "Sepal.Length", "Sepal.Width" and "Species"
# in common.
# "Petal.Length" is available only in iris.rec
# "Petal.Width" is available only in iris.don
# Find the closest donors donors computing distance
# on "Sepal.Length" and "Sepal.Width"
# unconstrained case, Euclidean distance
out.NND.1 <- NND.hotdeck(data.rec=iris.rec, data.don=iris.don,
match.vars=c("Sepal.Length", "Sepal.Width") )
# create the synthetic data.set:
# fill in "Petal.Width" in iris.rec
fused.1 <- create.fused(data.rec=iris.rec, data.don=iris.don,
mtc.ids=out.NND.1$mtc.ids, z.vars="Petal.Width")
head(fused.1)
# Find the closest donors computing distance
# on "Sepal.Length", "Sepal.Width" and Species;
# unconstrained case, Gower's distance
out.NND.2 <- NND.hotdeck(data.rec=iris.rec, data.don=iris.don,
match.vars=c("Sepal.Length", "Sepal.Width", "Species"),
dist.fun="Gower")
# find the closest donors using "Species" to form donation classes
# and "Sepal.Length" and "Sepal.Width" to compute distance;
# unconstrained case.
out.NND.3 <- NND.hotdeck(data.rec=iris.rec, data.don=iris.don,
match.vars=c("Sepal.Length", "Sepal.Width"),
don.class="Species")
# find the donors using "Species" to form donation classes
# and "Sepal.Length" and "Sepal.Width" to compute distance;
# constrained case, "Hungarian" algorithm
library(clue)
out.NND.4 <- NND.hotdeck(data.rec=iris.rec, data.don=iris.don,
match.vars=c("Sepal.Length", "Sepal.Width"),
don.class="Species", constrained=TRUE,
constr.alg="Hungarian")
# Example of Imputation of missing values.
# Introducing missing values in iris
ir.mat <- iris
miss <- rbinom(nrow(iris), 1, 0.3)
ir.mat[miss==1,"Sepal.Length"] <- NA
iris.rec <- ir.mat[miss==1,-1]
iris.don <- ir.mat[miss==0,]
#search for NND donors
imp.NND <- NND.hotdeck(data.rec=iris.rec, data.don=iris.don,
match.vars=c("Sepal.Width","Petal.Length", "Petal.Width"),
don.class="Species")
# imputing missing values
iris.rec.imp <- create.fused(data.rec=iris.rec, data.don=iris.don,
mtc.ids=imp.NND$mtc.ids, z.vars="Sepal.Length")
# rebuild the imputed data.frame
final <- rbind(iris.rec.imp, iris.don)
head(final)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.