Description Usage Arguments Value Input Formats Algorithmic Details Author(s) References See Also Examples
View source: R/mice.post.matching.r
Performs multivariate predictive mean matching (PMM) on a set of columns that
have been imputed on by the functionalities of the mice
package.
Also offers a functionality to match imputations against observed variables.
1 2 3 | mice.post.matching(obj, blocks = NULL, donors = 5L, weights = NULL,
distmetric = "residual", matchtype = 1L, match_vars = NULL,
ridge = 1e-05, minvar = 1e-04, maxcor = 0.99)
|
obj |
|
blocks |
Vector or list of vectors specifying the column tuples that are
to be imputed on blockwise. For a detailed explanation about the valid
formats of this parameter, see section input formats below.
Each element of each specified block has to be included in the
|
donors |
Integer or integer vector indicating the size of the donor pool
among which a draw in the matching step is made. If only a single number of
donors is specified, it is applied on all blocks, otherwise this parameter
needs to be a vector with as many elements as there blocks, specifying for
each of these blocks how many donors are to be drawn from. |
weights |
Numeric vector or list of numeric vectors that allocates
weights to the columns of each block, giving us the possibility to punish or
mitigate differences in certain columns of the data when computing the
distances in the matching step to determine the donor pool. Further details
on the valid formats of this parameter can be found below in section
input formats. |
distmetric |
Character string or character vector that determines which
mathematical metric we want to use to compute the distances between the
multivariate |
matchtype |
Integer of integer vector indicating the type of matching
distance to be used in PMM. If only a single matchtype is specified, it is
applied on all blocks, otherwise this parameter needs to be a vector with as
many elements as there blocks, specifying for each of these blocks which
matchtype has to be used. |
match_vars |
Vector specifying for each tuple which additional variable
has to be matched against. Can be an integer or character vector, either
specifying column indices or column names. |
ridge |
The ridge penalty used in an internal call of
|
minvar |
The minimum variance that we require predictors to have when
building a linear model to compute the |
maxcor |
The maximum correlation that we allow predictors to have when
building a linear model to compute the |
List containing the following two elements:
midsobj
mice::mids
object that differs from the input object
only in the imputations that have been post-processed, and the call
and loggedEvents
attributes that have been updated. In particular,
those post-processed imputations are not affecting the chainMean
or
chainVar
-attributes, and hence, plot()
will not consider them
either.
blocks
Set of column blocks in list format that multivariate
imputation has been performed on. It is equivalent to the input parameter
blocks
if it has been specified by the user, otherwise those column
tuples have been determined internally.
Within mice.post.matching()
and mice.binarize()
, there are two
formats that can be used to specify the input parameters blocks
and
weights
, namely the list format and the vector format.
The basic idea behind the list format is that we exclusively specify
parameters for those column blocks that we want to impute on and summarize
them in a list where each element is a vector that represents one such
block, while in the vector format we use a single vector in which each
element represents one column in the data, and therefore also specify
information for columns that we do not want to impute on. The exact use of
these two formats on both the blocks
and the weights
parameter
will be illustrated in the following.
blocks
1. List Format
To specify the imputation blocks
using the list format, a list of
atomic vectors has to be passed to the blocks
parameter, and each
vector in this list represents a column block that has to be imputed on.
These vectors can either contain the names of the columns in this block as
character strings or the corresponding column indices in numerical format.
Note that this input list must not contain any duplicate columns among and
within its elements. If there is only a single block to impute on, a
single atomic vector representing this tuple may also be passed to
blocks
.
Example:
Within this and the following examples of this section, we consider the
mammal_data
data set which contains 11 columns, out of which the
column tuples (sws, ps)
and (mls, gt)
have identical missing
value patterns (check ?mammal_data
for further details on the
data).
If we now wanted to specify these blocks in list format, we would have to
write
blocks = list(c("sws","ps"), c("mls", "gt"))
or analogously, when using column indices instead,
blocks = list(c(4,5), c(7,8))
.
2. Vector Format
If we want to specify the imputation blocks
via the vector format,
a single vector with as many elements as there are columns in the data has
to be used. Each element of this vector assigns a block number to its
corresponding column, and columns that have the same number are imputed
together, while columns that are not to be imputed have to carry the value
0
. All block numbers have to be integral, starting from 1
,
and the total number of imputation blocks has to be the maximum block
number.
Example:
Again we want to blockwise impute the column tuples (sws, ps)
and
(mls, gt)
from mammal_data
. To specify these blocks via the
vector notation, we assign block number 1
to the columns of the
first tuple and group number 2
to the columns of the second tuple,
while all other columns have block number 0
. Hence, we would pass
blocks = c(0,0,0,1,1,0,2,2,0,0,0)
to mice.post.matching()
.
weights
1. List Format
To specify the imputation weights
using the list format, the
corresponding imputations blocks must have been specified in list format
as well. In this case, the weights
list has to be the same length
as blocks, and each of its elements has a numeric vector of the same
length as its corresponding column block, thereby assigning each of its
columns a (strictly postitive) weight. If we do not want to apply weights
to a single tuple, we can write a single 0
, 1
or NULL
in the corresponding spot of the list.
Example:
In our example, we want to assign the sws
column a 1.5
times
higher weight than ps
, while not assigning any values to the second
tuple at all. To achieve that, we specify
weights = list(c(3,2), NULL)
.
2. Vector Format
When specifying the imputation weights
via the vector format, once
again a single vector with as many elements as there are columns in the
data has to be used, in which each element assigns a weight to its
corresponding column. Weights of columns that are not imputed on will have
no effect, while in all blocks that weights should not be applied on, each
column should carry the same value. In general, the value 1
should
be used as the standard weight value for all columns that either are not
imputed on or that weights are not applied on.
Example:
We want to assign the same weights to the first tuple as in the previous
example, while not assigning weights the second block again. In this case,
we would use the vector
weights = c(1,1,1,3,2,1,1,1,1,1,1)
in which all the values of the unimputed and unweighted columns are
1
.
Internally, mice.post.matching()
converts both parameters into the
list format as this is more natural to iterate on. Hence, the output
blocks
parameter also is in list format.
The algorithm basically iterates over the m
imputations of the
input mice::mids
object and each column tuple in blocks
, each
time following these two main steps:
First, we iterate over the columns of the
current block, collecting the complete column y on which we want to
impute, along with the matrix x of its predictors. Among the values
of eqny, we identify the observed values y_{obs} and the missing
values y_{mis}
along with their corresponding predictors
x_{obs} and x_{mis}
, restricting ourselves to only those
values whose predictors are complete, i.e. don't contain any missing
values themselves. Note that if the sets of predictor variables strongly
vary among all the columns in the current tuple, it may occur that there
are no common predictors in which case the function breaks.
From the observed values and their predictors, we build a Bayesian linear
regression model and, depending on the specified matching type, use the
model's coefficients and drawn regression weights β to compute
the predicted means \hat{y}_{obs} and \hat{y}_{mis} which we
then save in a matrix \hat{Y}.
After iterating over the columns in the current
tuple and building the matrix \hat{Y}, we match the multivariate
predictions of the missing values in \hat{Y} against the predictions
of the observed values. More precisely, for each \hat{y}_{mis}, we
perform a k-nearest-neighbor search among all values \hat{y}_{obs},
where k is the number of donors specified by the user, and then
randomly sample one element of those k nearest neighbors which is
then used to impute the missing value tuple in y. The distance
metric that is used to determine the nearest neighbors is specified by the
user as well, and in case of Euclidian or Manhattan distance its
computation is very straightforward. If the Mahalanobis distance or the
residual distance (as proposed by Little) has been selected, we first
compute the corresponding covariance matrix and its (pseudo) inverse via
its eigen decomposition, and then use it to transform the values
\hat{y}_{mis} and \hat{y}_{obs} that are fed into the nearest
neighbor search. If weights have been specified as well, they are also
applied before the kNN-search.
If an additional observed variable to match against has been specified via
match_vars
, the set of predictors and missing values are
partitioned by the values of the external variable first, and then the
matching is performed within each pair of corresponding subsets of that
partition.
Note that the imputed values are only stored as a result and, other than
in the mice
algorithm, are not used to compute predictive means for
any other missing value. We exclusively use the imputed values that are
provided within the input mids
object.
Tobias Schumacher, Philipp Gaffert
Little, R.J.A. (1988), Missing data adjustments in large surveys (with discussion), Journal of Business Economics and Statistics, 6, 287–301.
Van Buuren, S. (2012). Flexible Imputation of Missing Data. CRC/Chapman \& Hall, Boca Raton, FL.
Van Buuren, S., Groothuis-Oudshoorn, K. (2011). mice
: Multivariate
Imputation by Chained Equations in R
. Journal of Statistical
Software, 45(3), 1-67. http://www.jstatsoft.org/v45/i03/
mice
, mids-class
,
mice.binarize
, mice.factorize
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 | ## Not run:
#------------------------------------------------------------------------------
# Example on modified 'mammalsleep' data set from mice, that has identical
# missing data patterns on the column tuples ('ps','sws') and ('mls','gt')
#------------------------------------------------------------------------------
# run mice on data set 'mammal_data' and obtain a mids object to post-process
mids_mammal <- mice(mammal_data)
# run function, as blocks have not been specified, it will automatically detect
# the column tuples with identical missing data patterns and then impute on
# these
post_mammal <- mice.post.matching(mids_mammal)
# read out which column tuples have been imputed on
post_mammal$blocks
# look into imputations within resulting mice::mids object
post_mammal$midsobj$imp
#------------------------------------------------------------------------------
# Example on original 'mammalsleep' data set from mice, in which we
# want to post-process the imputations in column 'sws' by only imputing values
# from rows whose value in 'pi' matches the value of 'pi' in the row we impute
# on.
#------------------------------------------------------------------------------
# run mice on data set 'mammal_data' and obtain a mids object to post-process
mids_mammal <- mice(mammalsleep)
# run function, specify 'sws' as the column to impute on, and specify 'pi' as
# the observed variable to consider in the matching
post_mammal <- mice.post.matching(mids_mammal, blocks = "sws", match_vars = "pi")
# look into imputations within resulting mice::mids object
post_mammal$midsobj$imp
#------------------------------------------------------------------------------
# Examples illustrating the combined usage of blocks and weights, relating to
# the examples in the input format section above. As before, we want to impute
# on the column tuples ('ps','sws') and ('mls','gt') from mammal_data, while
# this time assigning weights to the first block, in which 'ps' gets a 1.5 times
# higher weight than 'sws'. The second tuple is not weighted.
#------------------------------------------------------------------------------
# run mice() first
mids_mammal <- mice(mammal_data)
## Now there are five options to specify the blocks and weights:
# First option: specify blocks and weights in list format
post_mammal <- mice.post.matching(obj = mids_mammal,
blocks = list(c("sws","ps"), c("mls","gt")),
weights = list(c(3,2), NULL))
# or equivalently, using colums indices:
post_mammal <- mice.post.matching(obj = mids_mammal,
blocks = list(c(4,5), c(7,8)),
weights = list(c(3,2), NULL))
# Second option: specify blocks and weights in vector format
post_mammal <- mice.post.matching(obj = mids_mammal,
blocks = c(0,0,0,1,1,0,2,2,0,0,0),
weights = c(1,1,1,3,2,1,1,1,1,1,1))
# Third option: specify blocks in list format and weights in vector format
post_mammal <- mice.post.matching(obj = mids_mammal,
blocks = list(c("sws","ps"), c("mls","gt")),
weights = c(1,1,1,3,2,1,1,1,1,1,1))
# Fourth option: specify blocks in vector format and weights in list format.
# Note that the block number determines which tuple in the weights list it
# corresponds to, and within each tuple in the list the weight correspondence is
# determinded by left to right order of the data columns
post_mammal <- mice.post.matching(obj = mids_mammal,
blocks = c(0,0,0,1,1,0,2,2,0,0,0),
weights = list(c(3,2), NULL))
# Fifth option: only specify weights in vector format. If the user knows
# beforehand that at least the column tuple he wants to impute and use weights
# on have the same missing value patterns, he can assign weights to these using
# the vector format, while letting mice.post.matching() find all other blocks
# with identical missing value patterns - possibly even more than just
# ('ps','sws') and ('mls','gt')
post_mammal <- mice.post.matching(obj = mids_mammal,
weights = c(1,1,1,3,2,1,1,1,1,1,1))
#------------------------------------------------------------------------------
# Example that illustrates the combined functionalities of mice.binarize(),
# mice.factorize() and mice.post.matching() on the data set 'boys_data', which
# contains the column blocks ('hgt','bmi') and ('hc','gen','phb') that have
# identical missing value patterns, and out of which the columns 'gen' and
# 'phb' are factors. We are going to impute both tuples blockwise, while
# binarizing the factor columns first. Note that we never need to specify any
# blocks or columns to binarize, as these are all determined automatically
#------------------------------------------------------------------------------
# By default, mice.binarize() expands all factor columns that contain NAs,
# so the columns 'gen' and 'phb' are automatically binarized
boys_bin <- mice.binarize(boys_data)
# Run mice on binarized data, note that we need to use boys_bin$data to grab
# the actual binarized data and that we use the output predictor matrix
# boys_bin$pred_matrix which is recommended for obtaining better imputation
# models
mids_boys <- mice(boys_bin$data, predictorMatrix = boys_bin$pred_matrix)
# It is very likely that mice imputed multiple ones among one set of dummy
# variables, so we need to post-process. As recommended, we also use the output
# weights from mice.binarize(), which yield a more balanced weighting on the
# column tuple ('hc','gen','phb') within the matching. As in previous examples,
# both tuples are automatically discovered and imputed on
post_boys <- mice.post.matching(mids_boys, weights = boys_bin$weights)
# Now we can safely retransform to the original data, with non-binarized
# imputations
res_boys <- mice.factorize(post_boys$midsobj, boys_bin$par_list)
# Analyze the distribution of imputed variables, e.g. of the column 'gen',
# using the mice version of with()
with(res_boys, table(gen))
#------------------------------------------------------------------------------
# Similar example to the previous, that also works on 'boys_data' and
# illustrates some more advanced funtionalities of all three functions in miceExt:
# This time we only want to post-process the column block ('gen','phb'), while
# weighting the first of these tuples twice as much as the second. Within the
# matching, we want to avoid matrix computations by using the euclidian distance
# to determine the donor pool, and we want to draw from three donors only.
#------------------------------------------------------------------------------
# Binarize first, we specify blocks in list format with a single block, so we
# can omit an enclosing list. Similarly, we also specify weights in list format.
# Both blocks and weights will be expanded and can be accessed from the output
# to use them in mice.post.matching() later
boys_bin <- mice.binarize(boys_data,
blocks = c("gen", "phb"),
weights = c(2,1))
# Run mice on binarized data, again use the output predictor matrix from
# mice.binarize()
mids_boys <- mice(boys_bin$data, predictorMatrix = boys_bin$pred_matrix)
# Post-process the binarized columns. We use blocks and weights from the previous
# output, and set 'distmetric' and 'donors' as announced in the example
# description
post_boys <- mice.post.matching(mids_boys,
blocks = boys_bin$blocks,
weights = boys_bin$weights,
distmetric = "euclidian",
donors = 3L)
# Finally, we can retransform to the original format
res_boys <- mice.factorize(post_boys$midsobj, boys_bin$par_list)
## End(Not run)
|
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.