Description Usage Arguments Value Input Formats Author(s) See Also Examples
View source: R/mice.binarize.R
This function replaces factor columns in data frames in-place by a set of
binary columns which represent the so-called one-hot encoding of this factor.
More precisely, a column of a factor with n
levels will be transformed
into a set of n
binary columns, each representing exactly one category
of the original factor. Hence, the value 1
occurs in a column if and
only if the original factor had the value corresponding to that column.
Further, this function also returns a weights vector and a predictor matrix
that fit to the binarized data frame. The weights vector is recommended to be
used as input for mice.post.matching()
, as this avoids the effect of
overweighing a big set of binary columns relating to one factor against other
columns within one imputation block, while the predictor matrix, when used as
input parameter in mice()
, ensures that binary columns that relate to
the same factor do not predict each other.
1 2 3 |
data |
Matrix or data frame that contains factor columns which we want to convert into an equivalent set of binary columns. |
include_ordered |
Logical variable indicating whether we also want to
transform ordered factors. Default is |
include_observed |
Logical variable indicating whether we also want to
transform factor columns in which all values are observed. Default is
|
cols |
Numerical vector corresponding to the indices or character vector
corresponding to the names of factor columns which we want to transform. By
default, its value is |
blocks |
Optional vector or list of vectors specifying the column tuples
that are to be imputed on blockwise later when running
|
weights |
Optional numeric vector or list of numeric vectors that
allocates weights to the columns in the data, in particular on those that
are going to be imputed on blockwise. If specified, the input will be
transformed into vector format (if necessary), and columns that correspond
to factor columns that are binarized will also be expanded. Within this
expansion, all weights that belong to factor columns are thereby also
divided by the number of levels of their corresponding factor column. This
is done to ensure a balanced weighting in the later matching process within
|
pred_matrix |
A custom predictor matrix relating to input |
List containing the following five elements:
data
The binarized data frame.
par_list
A list containing the original data frame as
well as some parameters with further information on the transformation. This
list is needed to retransform the (possibly imputed) data at later stager
via the mice.factorize()
function, and should not be edited by the
user under any circumstance. Next to the original data, the most notable
element of this list would be "dummy_cols", which itself is a list of the
column tuples that correspond to the transformed factor column from the
original data set, and therefore works perfectly as input for
mice.post.matching()
.
blocks
If input parameter blocks
has been
specified, an expanded version of that input is returned in vector format
via this element. It should be used as input parameter blocks
in
mice.post.matching()
later, after imputing the binarized data via
mice()
first.
weights
Transformed version of input parameter
weights
in vector format that should be used as input parameter
weights
in mice.post.matching()
later, after imputing the
binarized data via mice()
first. If input parameter weights
has not been specified, a default vector is still going to be output.
pred_matrix
Transformed version of input
pred_matrix
that should be used as the input argument
predictorMatrix
of mice()
.
Within mice.binarize()
and mice.post.matching()
, there are two
formats that can be used to specify the input parameters blocks
and
weights
, namely the list format and the vector format.
The basic idea behind the list format is that we exclusively specify
parameters for those column blocks that we want to impute on and summarize
them in a list where each element is a vector that represents one such
block, while in the vector format we use a single vector in which each
element represents one column in the data, and therefore also specify
information for columns that we do not want to impute on. The exact use of
these two formats on both the blocks
and the weights
parameter
will be illustrated in the following.
blocks
1. List Format
To specify the imputation blocks
using the list format, a list of
atomic vectors has to be passed to the blocks
parameter, and each
vector in this list represents a column block that has to be imputed on.
These vectors can either contain the names of the columns in this block
as character strings or the corresponding column indices in numerical
format. Note that this input list must not contain any duplicate columns
among and within its elements. If there is only a single block to impute
on, a single atomic vector representing this tuple may also be passed to
blocks
.
Example:
Within this and the following examples of this section, we consider the
boys_data
data set which contains 9 columns, out of which the
column tuples (hgt, bmi)
and (hc, gen, phb)
have identical
missing value patterns, while the columns gen
and phb
are
also categorical. (check ?boys_data
for further details on the
data).
If we now wanted to specify these blocks in list format, we would have to
write
blocks = list(c("hgt","bmi"), c("hc", "gen", "phb"))
or analogously, when using column indices instead,
blocks = list(c(2,4), c(5,6,7))
.
2. Vector Format
If we want to specify the imputation blocks
via the vector format,
a single vector with as many elements as there are columns in the data
has to be used. Each element of this vector assigns a block number to its
corresponding column, and columns that have the same number are imputed
together, while columns that are not to be imputed have to carry the
value 0
. All block numbers have to be integral, starting from
1
, and the total number of imputation blocks has to be the maximum
block number.
Example:
Again we want to blockwise impute the column tuples (hgt, bmi)
and
(hc, gen, phb)
from boys_data
. To specify these blocks via
the vector notation, we assign block number 1
to the columns of
the first tuple and group number 2
to the columns of the second
tuple, while all other columns have block number 0
. Hence, we
would pass
blocks = c(0,1,0,1,2,2,2,0,0)
to mice.post.matching()
.
weights
1. List Format
To specify the imputation weights
using the list format, the
corresponding imputations blocks must have been specified in list format
as well. In this case, the weights
list has to be the same length
as blocks, and each of its elements has a numeric vector of the same
length as its corresponding column block, thereby assigning each of its
columns a (strictly postitive) weight. If we do not want to apply weights
to a single tuple, we can write a single 0
, 1
or
NULL
in the corresponding spot of the list.
Example:
In our example, we want to assign the hgt
column a 1.5
times higher weight than bmi
, while not assigning any values to
the second tuple at all. To achieve that, we specify
weights = list(c(3,2), NULL)
.
2. Vector Format
When specifying the imputation weights
via the vector format, once
again a single vector with as many elements as there are columns in the
data has to be used, in which each element assigns a weight to its
corresponding column. Weights of columns that are not imputed on will
have no effect, while in all blocks that weights should not be applied
on, each column should carry the same value. In general, the value
1
should be used as the standard weight value for all columns that
either are not imputed on or that weights are not applied on.
Example:
We want to assign the same weights to the first tuple as in the previous
example, while not assigning weights the second block again. In this
case, we would use the vector
weights = c(1,3,1,2,1,1,1,1,1)
in which all the values of the unimputed and unweighted columns are
1
.
Internally, mice.post.matching()
converts both parameters into the
vector format as this works best within the main iteration over all columns
of the data. Hence, the output parameters blocks
and weights
are also in list format.
Tobias Schumacher, Philipp Gaffert
mice.factorize
,
mice.post.matching
, mice
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 | #------------------------------------------------------------------------------
# first set of examples illustrating basic functionality
#------------------------------------------------------------------------------
# binarize all factor columns in boys_data that contain NAs
boys_bin <- mice.binarize(boys_data)
# binarize only column 'gen' in boys_data
boys_bin <- mice.binarize(boys_data, cols = c("gen"))
# binarize all factor columns with the blocks ('hgt','bmi') and ('gen','phb')
# to impute on these (binarized) blocks later
boys_bin <- mice.binarize(boys_data, blocks = list(c("hgt","bmi"), c("gen", "phb")))
# read out binarized data
boys_bin$data
## Not run:
#------------------------------------------------------------------------------
# Examples illustrating the combined usage of blocks and weights, relating to
# the examples in the input format section above. As before, we want to impute
# on the column tuples ('hgt','bmi') and ('hc','gen','phb) from boys_data, while
# assigning weights to the first block, in which 'hgt' gets a 1.5 times
# higher weight than 'bmi'. The second tuple is not weighted.
#------------------------------------------------------------------------------
## Now there are four options to specify the blocks and weights:
# First option: specify blocks and weights in list format
boys_bin <- mice.binarize(data = boys_data,
blocks = list(c("hgt","bmi"), c("hc","gen","phb")),
weights = list(c(3,2), NULL))
# or equivalently, using colums indices:
boys_bin <- mice.binarize(data = boys_data,
blocks = list(c(2,4), c(5,6,7)),
weights = list(c(3,2), NULL))
# Second option: specify blocks in list and weights in vector format
post_mammal <- mice.binarize(data = boys_data,
blocks = c(0,1,0,1,2,2,2,0,0),
weights = c(1,3,1,2,1,1,1,1,1))
# Third option: specify blocks in list format and weights in vector format
post_mammal <- mice.binarize(data = boys_data,
blocks = list(c("hgt","bmi"), c("hc","gen", "phb")),
weights = c(1,3,1,2,1,1,1,1,1))
# Fourth option: specify blocks in vector format and weights in list format.
# Note that the block number determines which tuple in the weights list it
# corresponds to, and within each tuple in the list the weight correspondence is
# determinded by left to right order of the data columns
post_mammal <- mice.binarize(data = boys_data,
blocks = c(0,1,0,1,2,2,2,0,0),
weights = list(c(3,2), NULL))
# check expanded blocks vector
boys_bin$blocks
# check expanded weights vector
boys_bin$weights
#------------------------------------------------------------------------------
# Example that illustrates the combined functionalities of mice.binarize(),
# mice.factorize() and mice.post.matching() on the data set 'boys_data', which
# contains the column blocks ('hgt','bmi') and ('hc','gen','phb') that have
# identical missing value patterns, and out of which the columns 'gen' and
# 'phb' are factors. We are going to impute both tuples blockwise, while
# binarizing the factor columns first. Note that we never need to specify any
# blocks or columns to binarize, as these are all determined automatically
#------------------------------------------------------------------------------
# By default, mice.binarize() expands all factor columns that contain NAs,
# so the columns 'gen' and 'phb' are automatically binarized
boys_bin <- mice.binarize(boys_data)
# Run mice on binarized data, note that we need to use boys_bin$data to grab
# the actual binarized data and that we use the output predictor matrix
# boys_bin$pred_matrix which is recommended for obtaining better imputation
# models
mids_boys <- mice(boys_bin$data, predictorMatrix = boys_bin$pred_matrix)
# It is very likely that mice imputed multiple ones among one set of dummy
# variables, so we need to post-process. As recommended, we also use the output
# weights from mice.binarize(), which yield a more balanced weighting on the
# column tuple ('hc','gen','phb') within the matching. As in previous examples,
# both tuples are automatically discovered and imputed on
post_boys <- mice.post.matching(mids_boys, weights = boys_bin$weights)
# Now we can safely retransform to the original data, with non-binarized
# imputations
res_boys <- mice.factorize(post_boys$midsobj, boys_bin$par_list)
# Analyze the distribution of imputed variables, e.g. of the column 'gen',
# using the mice version of with()
with(res_boys, table(gen))
#------------------------------------------------------------------------------
# Similar example to the previous, that also works on 'boys_data' and
# illustrates some more advanced funtionalities of all three functions in miceExt:
# This time we only want to post-process the column block ('gen','phb'), while
# weighting the first of these tuples twice as much as the second. Within the
# matching, we want to avoid matrix computations by using the euclidian distance
# to determine the donor pool, and we want to draw from three donors only.
#------------------------------------------------------------------------------
# Binarize first, we specify blocks in list format with a single block, so we
# can omit an enclosing list. Similarly, we also specify weights in list format.
# Both blocks and weights will be expanded and can be accessed from the output
# to use them in mice.post.matching() later
boys_bin <- mice.binarize(boys_data,
blocks = c("gen", "phb"),
weights = c(2,1))
# Run mice on binarized data, again use the output predictor matrix from
# mice.binarize()
mids_boys <- mice(boys_bin$data, predictorMatrix = boys_bin$pred_matrix)
# Post-process the binarized columns. We use blocks and weights from the previous
# output, and set 'distmetric' and 'donors' as announced in the example
# description
post_boys <- mice.post.matching(mids_boys,
blocks = boys_bin$blocks,
weights = boys_bin$weights,
distmetric = "euclidian",
donors = 3L)
# Finally, we can retransform to the original format
res_boys <- mice.factorize(post_boys$midsobj, boys_bin$par_list)
## End(Not run)
|
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.