Description Usage Arguments Details Value Author(s) References See Also Examples
View source: R/HotDeckImputation.R
A comprehensive function that performs nearest neighbor hot deck imputation. Aspects such as variable weighting, distance types, and donor limiting are implemented. New concepts such as the optimal distribution of donors are also available.
1 2 3 | impute.NN_HD(DATA = NULL, distance = "man", weights = "range", attributes = "sim",
comp = "rw_dist", donor_limit = Inf, optimal_donor = "no",
list_donors_recipients = NULL, diagnose = NULL)
|
DATA |
Data containing missing values. Must either be |
distance |
Distance type to use when searching for the nearest neighbor. See the details section for options. |
weights |
Weights by which the variables should be scaled. See the details section for options. |
attributes |
Determines how attributes should be handled. Currently only "sim", meaning donor and recipient pools are disjoint, is implemented. |
comp |
Defines the compensation of missing values for distance calculation. See the details section for options. |
donor_limit |
Limits how often a donor may function as such. See the details section for options. |
optimal_donor |
Defines how the optimal donor is found when a donor limit is used. See the details section for options. |
list_donors_recipients |
Option for manually specifying the donor and recipient pools via a list with components "donors" and "recipients". |
diagnose |
Option to recover the generated distances and donor-recipient matches. See details section for usage. |
argument: distance
can be defined as:
numeric matrix, donors x recipients distance matrix
numeric length = 1, Minkovski parameter
string length = 1, distance metric to be used:
"man", Manhattan distance
"eukl", Euclidean distance
"tscheb", Chebyshev distance
"mahal", Mahalanobis distance (covariance matrix calculated after missing data compensation, incompatible with comp="rw_dist")
argument: weights
can be defined as:
string length = 1, reweighting type "range", "sd", "var", or "none"
numeric length = 1, one numeric weight for all variables
string vector, like option 1, only different type for each variable (not implemented)
numeric vector, like option 2, only different numeric weight for each variable
list, mixture of options 3 and 4 (not implemented)
argument: comp
can be defined as:
"rw_dist", total distance is reweighted by number of distances that may be computed
"mean", mean imputation is performed before distance calculation
"rseq", random hot deck imputation, each variable sequentially (uses impute.SEQ_HD)
"rsim", random hot deck imputation, each variable simultaneously (not implemented)
argument: donor_limit
is a single number interpreted by its range:
(0,1), dynamic donor limit, i.e., .5 means any 1 donor may serve up to 50% of all recipients, rounded up if fractional
[1,Inf), static donor limit, i.e., 2 means any 1 donor may serve up to 2 recipients, fractional parts discarded
Inf, no donor limit
argument: optimal_donor
is a single string interpreted by its value:
"no", donor-recipient matching is performed in order by which the recipients appear in the data (fastest)
"rand", donor-recipient matching is performed in a random recipient-order
"mmin", donor-recipient matching is performed by the matrix minimum method (sequence independent)
"modifvam", donor-recipient matching is performed by a modified (only columns considered) Vogel's approximation method (sequence independent)
"vam", donor-recipient matching is performed by the Vogel's approximation method (sequence independent)
"odd", donor-recipient matching is performed via the optimal donor distribution method (sequence independent, best results)
argument: diagnose
should be:
NULL
, no diagnostics will be returned.
character string, desired variable name under which the diagnostics will be saved to .GlobalEnv
. The following character strings will however default to NULL
with a warning:
"if", "else", "repeat", "while", "function", "for", "in", "next", "break", "TRUE", "FALSE", "NULL", "Inf", "NaN", "NA", "NA_integer_", "NA_real_", "NA_complex_", "NA_character_", "c", "q", "s", "t", "C", "D", "F", "I", "T"
anything else, defaults to NULL
with a warning.
Should be a character string of the desired variable name which will be created in .GlobalEnv
An imputed data matrix the same size as the input DATA
.
If the diagnose
option is used correctly, a list containing the following components will be created in the workspace:
distances |
the donor-recipient distance matrix used for matching |
list_donors_recipients |
the resultant recipient-donor matches |
Dieter William Joenssen Dieter.Joenssen@googlemail.com
Andridge, R.R. and Little, R.J.A. (2010) A Review of Hot Deck Imputation for Survey Non-response. International Statistical Review. 78, 40–64.
Bankhofer, U. and Joenssen, D.W. (2014) On Limiting Donor Usage for Imputation of Missing Data via Hot Deck Methods. In: M. Spiliopoulou, L. Schmidt-Thieme, and R. Jannings (Eds.): Data Analysis, Machine Learning and Knowledge Discovery. Studies in Classification, Data Analysis and Knowledge Organization, 3–11. Berlin/Heidelberg: Springer.
Domschke, W. (1995) Logistik: Transport. Munich: Oldenbourg. [in German]
Ford, B. (1983) An Overview of Hot Deck Procedures. In: W. Madow, H. Nisselson and I. Olkin (Eds.): Incomplete Data in Sample Surveys. New York: Academic Press, 185–207.
Joenssen, D.W. (2015) Donor Limited Hot Deck Imputation: A Constrained Optimization Problem. In: B. Lausen, S. Krolak-Schwerdt, and M. B\"ohmer (Eds.): Data Science, Learning by Latent Structures, and Knowledge Discovery. Studies in Classification, Data Analysis and Knowledge Organization, pages 319–328. Berlin/Heidelberg: Springer.
Joenssen, D.W. (2015) Hot-Deck-Verfahren zur Imputation fehlender Daten – Auswirkungen des Donor-Limits. Ilmenau: Ilmedia. [in German, Dissertation]
Joenssen, D.W. and Bankhofer, U. (2012) Donor Limited Hot Deck Imputation: Effects on Parameter Estimation. Journal of Theoretical and Applied Computer Science. 6, 58–70.
Kalton, G. and Kasprzyk, D. (1986) The Treatment of Missing Survey Data. Survey Methodology. 12, 1–16.
Sande, I. (1983) Hot-Deck Imputation Procedures. In: W. Madow, H. Nisselson and I. Olkin (Eds.): Incomplete Data in Sample Surveys. New York: Academic Press, 339–349.
impute.mean
, match.d_r_vam
, reweight.data
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 | #Set the random seed to an arbitrary number
set.seed(421)
#Generate random integer matrix size 10x4
Y<-matrix(sample(x=1:100,size=10*4),nrow=10)
#remove 5 values, ensuring one complete covariate and 5 donors
Y[-c(1:5),-1][sample(1:15,size=5)]<-NA
#Impute using various different (arbitrarily chosen) settings
impute.NN_HD(DATA=Y,distance="man",weights="var")
impute.NN_HD(DATA=Y,distance=2,weights=rep(.5,4),donor_limit=2,optimal_donor="mmin")
impute.NN_HD(DATA=Y,distance="eukl",weights=.25,comp="mean",donor_limit=1,
optimal_donor="odd")
#Recover some diagnostics
impute.NN_HD(DATA=Y,distance="eukl",weights=.25,comp="mean",donor_limit=1,
optimal_donor="odd",diagnose = "diagnostics")
# look at the diagnostics
diagnostics
|
[,1] [,2] [,3] [,4]
[1,] 79 29 73 87
[2,] 15 32 96 54
[3,] 71 26 25 1
[4,] 31 65 78 57
[5,] 82 70 94 24
[6,] 67 26 8 1
[7,] 85 27 95 87
[8,] 64 48 83 87
[9,] 49 12 36 81
[10,] 20 43 96 97
[,1] [,2] [,3] [,4]
[1,] 79 29 73 87
[2,] 15 32 96 54
[3,] 71 26 25 1
[4,] 31 65 78 57
[5,] 82 70 94 24
[6,] 67 26 8 1
[7,] 85 27 95 87
[8,] 64 48 83 87
[9,] 49 12 36 81
[10,] 20 43 96 97
[,1] [,2] [,3] [,4]
[1,] 79 29 73 87
[2,] 15 32 96 54
[3,] 71 26 25 1
[4,] 31 65 78 57
[5,] 82 70 94 24
[6,] 67 12 8 81
[7,] 85 27 95 87
[8,] 64 48 83 57
[9,] 49 12 36 81
[10,] 20 43 96 97
[,1] [,2] [,3] [,4]
[1,] 79 29 73 87
[2,] 15 32 96 54
[3,] 71 26 25 1
[4,] 31 65 78 57
[5,] 82 70 94 24
[6,] 67 12 8 81
[7,] 85 27 95 87
[8,] 64 48 83 57
[9,] 49 12 36 81
[10,] 20 43 96 97
$arguements
$arguements$distance
[1] "eukl"
$arguements$weights
[1] 0.25
$arguements$attributes
[1] "sim"
$arguements$comp
[1] "mean"
$arguements$donor_limit
[1] 1
$arguements$optimal_donor
[1] "odd"
$arguements$list_donors_recipients
NULL
$arguements$diagnose
[1] "diagnostics"
$distances
6 7 8 10
1 1338.5433 351.7347 392.2347 958.9444
2 2627.3410 1234.1990 709.1990 733.8611
3 911.2457 2066.2704 1766.2704 3433.1944
4 1716.5791 1162.2704 350.7704 591.3611
5 2420.7656 741.7347 509.2347 2680.9444
9 601.3449 1391.0918 1073.0918 729.6111
$list_donors_recipients
recipient donor
[1,] 6 9
[2,] 7 1
[3,] 8 4
[4,] 10 2
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.