impute.NN_HD: The Nearest Neighbor Hot Deck Algorithms

Description Usage Arguments Details Value Author(s) References See Also Examples

View source: R/HotDeckImputation.R

Description

A comprehensive function that performs nearest neighbor hot deck imputation. Aspects such as variable weighting, distance types, and donor limiting are implemented. New concepts such as the optimal distribution of donors are also available.

Usage

1
2
3
impute.NN_HD(DATA = NULL, distance = "man", weights = "range", attributes = "sim",
 comp = "rw_dist", donor_limit = Inf, optimal_donor = "no",
 list_donors_recipients = NULL, diagnose = NULL)

Arguments

DATA

Data containing missing values. Must either be data.frame, then factors and strings will be recoded using model.matrix or Will be coerced by data.matrix.

distance

Distance type to use when searching for the nearest neighbor. See the details section for options.

weights

Weights by which the variables should be scaled. See the details section for options.

attributes

Determines how attributes should be handled. Currently only "sim", meaning donor and recipient pools are disjoint, is implemented.

comp

Defines the compensation of missing values for distance calculation. See the details section for options.

donor_limit

Limits how often a donor may function as such. See the details section for options.

optimal_donor

Defines how the optimal donor is found when a donor limit is used. See the details section for options.

list_donors_recipients

Option for manually specifying the donor and recipient pools via a list with components "donors" and "recipients".

diagnose

Option to recover the generated distances and donor-recipient matches. See details section for usage.

Details

argument: distance can be defined as:

argument: weights can be defined as:

argument: comp can be defined as:

argument: donor_limit is a single number interpreted by its range:

argument: optimal_donor is a single string interpreted by its value:

argument: diagnose should be:

Should be a character string of the desired variable name which will be created in .GlobalEnv

Value

An imputed data matrix the same size as the input DATA. If the diagnose option is used correctly, a list containing the following components will be created in the workspace:

distances

the donor-recipient distance matrix used for matching

list_donors_recipients

the resultant recipient-donor matches

Author(s)

Dieter William Joenssen Dieter.Joenssen@googlemail.com

References

Andridge, R.R. and Little, R.J.A. (2010) A Review of Hot Deck Imputation for Survey Non-response. International Statistical Review. 78, 40–64.

Bankhofer, U. and Joenssen, D.W. (2014) On Limiting Donor Usage for Imputation of Missing Data via Hot Deck Methods. In: M. Spiliopoulou, L. Schmidt-Thieme, and R. Jannings (Eds.): Data Analysis, Machine Learning and Knowledge Discovery. Studies in Classification, Data Analysis and Knowledge Organization, 3–11. Berlin/Heidelberg: Springer.

Domschke, W. (1995) Logistik: Transport. Munich: Oldenbourg. [in German]

Ford, B. (1983) An Overview of Hot Deck Procedures. In: W. Madow, H. Nisselson and I. Olkin (Eds.): Incomplete Data in Sample Surveys. New York: Academic Press, 185–207.

Joenssen, D.W. (2015) Donor Limited Hot Deck Imputation: A Constrained Optimization Problem. In: B. Lausen, S. Krolak-Schwerdt, and M. B\"ohmer (Eds.): Data Science, Learning by Latent Structures, and Knowledge Discovery. Studies in Classification, Data Analysis and Knowledge Organization, pages 319–328. Berlin/Heidelberg: Springer.

Joenssen, D.W. (2015) Hot-Deck-Verfahren zur Imputation fehlender Daten – Auswirkungen des Donor-Limits. Ilmenau: Ilmedia. [in German, Dissertation]

Joenssen, D.W. and Bankhofer, U. (2012) Donor Limited Hot Deck Imputation: Effects on Parameter Estimation. Journal of Theoretical and Applied Computer Science. 6, 58–70.

Kalton, G. and Kasprzyk, D. (1986) The Treatment of Missing Survey Data. Survey Methodology. 12, 1–16.

Sande, I. (1983) Hot-Deck Imputation Procedures. In: W. Madow, H. Nisselson and I. Olkin (Eds.): Incomplete Data in Sample Surveys. New York: Academic Press, 339–349.

See Also

impute.mean, match.d_r_vam, reweight.data

Examples

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
#Set the random seed to an arbitrary number
set.seed(421)

#Generate random integer matrix size 10x4
Y<-matrix(sample(x=1:100,size=10*4),nrow=10)

#remove 5 values, ensuring one complete covariate and 5 donors
Y[-c(1:5),-1][sample(1:15,size=5)]<-NA

#Impute using various different (arbitrarily chosen) settings
impute.NN_HD(DATA=Y,distance="man",weights="var")

impute.NN_HD(DATA=Y,distance=2,weights=rep(.5,4),donor_limit=2,optimal_donor="mmin")

impute.NN_HD(DATA=Y,distance="eukl",weights=.25,comp="mean",donor_limit=1,
 optimal_donor="odd")
 
#Recover some diagnostics
impute.NN_HD(DATA=Y,distance="eukl",weights=.25,comp="mean",donor_limit=1,
 optimal_donor="odd",diagnose = "diagnostics")
# look at the diagnostics
 diagnostics

Example output

      [,1] [,2] [,3] [,4]
 [1,]   79   29   73   87
 [2,]   15   32   96   54
 [3,]   71   26   25    1
 [4,]   31   65   78   57
 [5,]   82   70   94   24
 [6,]   67   26    8    1
 [7,]   85   27   95   87
 [8,]   64   48   83   87
 [9,]   49   12   36   81
[10,]   20   43   96   97
      [,1] [,2] [,3] [,4]
 [1,]   79   29   73   87
 [2,]   15   32   96   54
 [3,]   71   26   25    1
 [4,]   31   65   78   57
 [5,]   82   70   94   24
 [6,]   67   26    8    1
 [7,]   85   27   95   87
 [8,]   64   48   83   87
 [9,]   49   12   36   81
[10,]   20   43   96   97
      [,1] [,2] [,3] [,4]
 [1,]   79   29   73   87
 [2,]   15   32   96   54
 [3,]   71   26   25    1
 [4,]   31   65   78   57
 [5,]   82   70   94   24
 [6,]   67   12    8   81
 [7,]   85   27   95   87
 [8,]   64   48   83   57
 [9,]   49   12   36   81
[10,]   20   43   96   97
      [,1] [,2] [,3] [,4]
 [1,]   79   29   73   87
 [2,]   15   32   96   54
 [3,]   71   26   25    1
 [4,]   31   65   78   57
 [5,]   82   70   94   24
 [6,]   67   12    8   81
 [7,]   85   27   95   87
 [8,]   64   48   83   57
 [9,]   49   12   36   81
[10,]   20   43   96   97
$arguements
$arguements$distance
[1] "eukl"

$arguements$weights
[1] 0.25

$arguements$attributes
[1] "sim"

$arguements$comp
[1] "mean"

$arguements$donor_limit
[1] 1

$arguements$optimal_donor
[1] "odd"

$arguements$list_donors_recipients
NULL

$arguements$diagnose
[1] "diagnostics"


$distances
          6         7         8        10
1 1338.5433  351.7347  392.2347  958.9444
2 2627.3410 1234.1990  709.1990  733.8611
3  911.2457 2066.2704 1766.2704 3433.1944
4 1716.5791 1162.2704  350.7704  591.3611
5 2420.7656  741.7347  509.2347 2680.9444
9  601.3449 1391.0918 1073.0918  729.6111

$list_donors_recipients
     recipient donor
[1,]         6     9
[2,]         7     1
[3,]         8     4
[4,]        10     2

HotDeckImputation documentation built on May 2, 2019, 6:41 a.m.