mice.binarize: Binarize Factor Columns in Data Frames

Description Usage Arguments Value Input Formats Author(s) See Also Examples

View source: R/mice.binarize.R

Description

This function replaces factor columns in data frames in-place by a set of binary columns which represent the so-called one-hot encoding of this factor. More precisely, a column of a factor with n levels will be transformed into a set of n binary columns, each representing exactly one category of the original factor. Hence, the value 1 occurs in a column if and only if the original factor had the value corresponding to that column.
Further, this function also returns a weights vector and a predictor matrix that fit to the binarized data frame. The weights vector is recommended to be used as input for mice.post.matching(), as this avoids the effect of overweighing a big set of binary columns relating to one factor against other columns within one imputation block, while the predictor matrix, when used as input parameter in mice(), ensures that binary columns that relate to the same factor do not predict each other.

Usage

1
2
3
mice.binarize(data, include_ordered = TRUE, include_observed = FALSE,
  cols = NULL, blocks = NULL, weights = rep(1, ncol(data)),
  pred_matrix = (1 - diag(1, ncol(data))))

Arguments

data

Matrix or data frame that contains factor columns which we want to convert into an equivalent set of binary columns.

include_ordered

Logical variable indicating whether we also want to transform ordered factors. Default is TRUE.

include_observed

Logical variable indicating whether we also want to transform factor columns in which all values are observed. Default is FALSE.

cols

Numerical vector corresponding to the indices or character vector corresponding to the names of factor columns which we want to transform. By default, its value is NULL, indicating that the algorithm automatically identifies all factor columns that are to be binarized. If however the user specifies its value, the function exclusively transforms the specified columns, ignoring the values of the previous optional parameters.

blocks

Optional vector or list of vectors specifying the column tuples that are to be imputed on blockwise later when running mice.post.matching(). If this parameter is specified, the function will dectect all factor columns withing this argument and binarize these, while also producing a corresponding expanded block vector that can be found in the output of this function. In particular, whenever this parameter is specified, the values of the previous three parameters will be ignored. For a detailed explanation about the valid formats of this parameter, see section input formats below.
The default of this parameter is blocks = NULL, in which case the output parameter blocks will also be NULL.

weights

Optional numeric vector or list of numeric vectors that allocates weights to the columns in the data, in particular on those that are going to be imputed on blockwise. If specified, the input will be transformed into vector format (if necessary), and columns that correspond to factor columns that are binarized will also be expanded. Within this expansion, all weights that belong to factor columns are thereby also divided by the number of levels of their corresponding factor column. This is done to ensure a balanced weighting in the later matching process within mice.post.matching(), as in the case that a factor column with many levels is in the same block with factor columns that much fewer levels or even with single numeric columns, the predictive means of the dummy columns of this factor would have much more impact on the matching than predictive means of the other column(s) simply because the latter would be outnumbered.
The default of this parameter is weights = rep(1, ncol(data)), which initially assigns the weight 1 to all columns before expanding them and reducing the weight of binarized columns as explained above. In any way, it is strongly recommended to use the transformed output weights as the weights parameter in mice.post.matching().
Note that specifying this parameter in list format is only allowed if blocks have been specified.

pred_matrix

A custom predictor matrix relating to input data, which will get transformed into the format that fits to the binarized output data frame. The result of this transformation will be stored in the pred_matrix element of the output and should then be used as the predictorMatrix parameter in mice() to ensure that binary columns relating to the same factor column in the original data do not predict each other, yielding cleaner imputation models. If not specified, the default is the massive imputation predictor matrix.

Value

List containing the following five elements:

data

The binarized data frame.

par_list

A list containing the original data frame as well as some parameters with further information on the transformation. This list is needed to retransform the (possibly imputed) data at later stager via the mice.factorize() function, and should not be edited by the user under any circumstance. Next to the original data, the most notable element of this list would be "dummy_cols", which itself is a list of the column tuples that correspond to the transformed factor column from the original data set, and therefore works perfectly as input for mice.post.matching().

blocks

If input parameter blocks has been specified, an expanded version of that input is returned in vector format via this element. It should be used as input parameter blocks in mice.post.matching() later, after imputing the binarized data via mice() first.

weights

Transformed version of input parameter weights in vector format that should be used as input parameter weights in mice.post.matching() later, after imputing the binarized data via mice() first. If input parameter weights has not been specified, a default vector is still going to be output.

pred_matrix

Transformed version of input pred_matrix that should be used as the input argument predictorMatrix of mice().

Input Formats

Within mice.binarize() and mice.post.matching(), there are two formats that can be used to specify the input parameters blocks and weights, namely the list format and the vector format. The basic idea behind the list format is that we exclusively specify parameters for those column blocks that we want to impute on and summarize them in a list where each element is a vector that represents one such block, while in the vector format we use a single vector in which each element represents one column in the data, and therefore also specify information for columns that we do not want to impute on. The exact use of these two formats on both the blocks and the weights parameter will be illustrated in the following.

blocks

1. List Format
To specify the imputation blocks using the list format, a list of atomic vectors has to be passed to the blocks parameter, and each vector in this list represents a column block that has to be imputed on. These vectors can either contain the names of the columns in this block as character strings or the corresponding column indices in numerical format. Note that this input list must not contain any duplicate columns among and within its elements. If there is only a single block to impute on, a single atomic vector representing this tuple may also be passed to blocks.

Example:
Within this and the following examples of this section, we consider the boys_data data set which contains 9 columns, out of which the column tuples (hgt, bmi) and (hc, gen, phb) have identical missing value patterns, while the columns gen and phb are also categorical. (check ?boys_data for further details on the data).
If we now wanted to specify these blocks in list format, we would have to write
blocks = list(c("hgt","bmi"), c("hc", "gen", "phb"))
or analogously, when using column indices instead,
blocks = list(c(2,4), c(5,6,7)).

2. Vector Format
If we want to specify the imputation blocks via the vector format, a single vector with as many elements as there are columns in the data has to be used. Each element of this vector assigns a block number to its corresponding column, and columns that have the same number are imputed together, while columns that are not to be imputed have to carry the value 0. All block numbers have to be integral, starting from 1, and the total number of imputation blocks has to be the maximum block number.

Example:
Again we want to blockwise impute the column tuples (hgt, bmi) and (hc, gen, phb) from boys_data. To specify these blocks via the vector notation, we assign block number 1 to the columns of the first tuple and group number 2 to the columns of the second tuple, while all other columns have block number 0. Hence, we would pass
blocks = c(0,1,0,1,2,2,2,0,0)
to mice.post.matching().

weights

1. List Format
To specify the imputation weights using the list format, the corresponding imputations blocks must have been specified in list format as well. In this case, the weights list has to be the same length as blocks, and each of its elements has a numeric vector of the same length as its corresponding column block, thereby assigning each of its columns a (strictly postitive) weight. If we do not want to apply weights to a single tuple, we can write a single 0, 1 or NULL in the corresponding spot of the list.

Example:
In our example, we want to assign the hgt column a 1.5 times higher weight than bmi, while not assigning any values to the second tuple at all. To achieve that, we specify
weights = list(c(3,2), NULL).

2. Vector Format
When specifying the imputation weights via the vector format, once again a single vector with as many elements as there are columns in the data has to be used, in which each element assigns a weight to its corresponding column. Weights of columns that are not imputed on will have no effect, while in all blocks that weights should not be applied on, each column should carry the same value. In general, the value 1 should be used as the standard weight value for all columns that either are not imputed on or that weights are not applied on.

Example:
We want to assign the same weights to the first tuple as in the previous example, while not assigning weights the second block again. In this case, we would use the vector
weights = c(1,3,1,2,1,1,1,1,1)
in which all the values of the unimputed and unweighted columns are 1.

Internally, mice.post.matching() converts both parameters into the vector format as this works best within the main iteration over all columns of the data. Hence, the output parameters blocks and weights are also in list format.

Author(s)

Tobias Schumacher, Philipp Gaffert

See Also

mice.factorize, mice.post.matching, mice

Examples

  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
#------------------------------------------------------------------------------
# first set of examples illustrating basic functionality
#------------------------------------------------------------------------------

# binarize all factor columns in boys_data that contain NAs
boys_bin <- mice.binarize(boys_data)

# binarize only column 'gen' in boys_data
boys_bin <- mice.binarize(boys_data, cols = c("gen"))

# binarize all factor columns with the blocks ('hgt','bmi') and ('gen','phb')
# to impute on these (binarized) blocks later
boys_bin <- mice.binarize(boys_data, blocks = list(c("hgt","bmi"), c("gen", "phb")))

# read out binarized data
boys_bin$data



## Not run: 
#------------------------------------------------------------------------------
# Examples illustrating the combined usage of blocks and weights, relating to
# the examples in the input format section above. As before, we want to impute  
# on the column tuples ('hgt','bmi') and ('hc','gen','phb) from boys_data, while  
# assigning weights to the first block, in which 'hgt' gets a 1.5 times
# higher weight than 'bmi'. The second tuple is not weighted. 
#------------------------------------------------------------------------------

## Now there are four options to specify the blocks and weights:

# First option: specify blocks and weights in list format
boys_bin <- mice.binarize(data = boys_data,
                                blocks = list(c("hgt","bmi"), c("hc","gen","phb")),
                                weights = list(c(3,2), NULL))
                              
# or equivalently, using colums indices:
boys_bin <- mice.binarize(data = boys_data,
                           blocks = list(c(2,4), c(5,6,7)),
                           weights = list(c(3,2), NULL))
                           
# Second option: specify blocks in list and weights in vector format
post_mammal <- mice.binarize(data = boys_data,
                                   blocks = c(0,1,0,1,2,2,2,0,0),
                                   weights = c(1,3,1,2,1,1,1,1,1))

# Third option: specify blocks in list format and weights in vector format
post_mammal <- mice.binarize(data = boys_data,
                                   blocks = list(c("hgt","bmi"), c("hc","gen", "phb")),
                                   weights = c(1,3,1,2,1,1,1,1,1))

# Fourth option: specify blocks in vector format and weights in list format.
# Note that the block number determines which tuple in the weights list it
# corresponds to, and within each tuple in the list the weight correspondence is
# determinded by left to right order of the data columns
post_mammal <- mice.binarize(data = boys_data,
                                   blocks = c(0,1,0,1,2,2,2,0,0),
                                   weights = list(c(3,2), NULL))

# check expanded blocks vector
boys_bin$blocks

# check expanded weights vector
boys_bin$weights



#------------------------------------------------------------------------------
# Example that illustrates the combined functionalities of mice.binarize(),
# mice.factorize() and mice.post.matching() on the data set 'boys_data', which
# contains the column blocks ('hgt','bmi') and ('hc','gen','phb') that have
# identical missing value patterns, and out of which the columns 'gen' and
# 'phb' are factors. We are going to impute both tuples blockwise, while
# binarizing the factor columns first. Note that we never need to specify any
# blocks or columns to binarize, as these are all determined automatically 
#------------------------------------------------------------------------------

# By default, mice.binarize() expands all factor columns that contain NAs,
# so the columns 'gen' and 'phb' are automatically binarized
boys_bin <- mice.binarize(boys_data)

# Run mice on binarized data, note that we need to use boys_bin$data to grab
# the actual binarized data and that we use the output predictor matrix
# boys_bin$pred_matrix which is recommended for obtaining better imputation
# models
mids_boys <- mice(boys_bin$data, predictorMatrix = boys_bin$pred_matrix)

# It is very likely that mice imputed multiple ones among one set of dummy
# variables, so we need to post-process. As recommended, we also use the output
# weights from mice.binarize(), which yield a more balanced weighting on the
# column tuple ('hc','gen','phb') within the matching. As in previous examples,
# both tuples are automatically discovered and imputed on
post_boys <- mice.post.matching(mids_boys, weights = boys_bin$weights)

# Now we can safely retransform to the original data, with non-binarized
# imputations
res_boys <- mice.factorize(post_boys$midsobj, boys_bin$par_list)

# Analyze the distribution of imputed variables, e.g. of the column 'gen',
# using the mice version of with()
with(res_boys, table(gen))



#------------------------------------------------------------------------------
# Similar example to the previous, that also works on 'boys_data' and
# illustrates some more advanced funtionalities of all three functions in miceExt: 
# This time we only want to post-process the column block ('gen','phb'), while
# weighting the first of these tuples twice as much as the second. Within the
# matching, we want to avoid matrix computations by using the euclidian distance
# to determine the donor pool, and we want to draw from three donors only.
#------------------------------------------------------------------------------

# Binarize first, we specify blocks in list format with a single block, so we 
# can omit an enclosing list. Similarly, we also specify weights in list format.
# Both blocks and weights will be expanded and can be accessed from the output
# to use them in mice.post.matching() later
boys_bin <- mice.binarize(boys_data, 
                         blocks = c("gen", "phb"), 
                         weights = c(2,1))

# Run mice on binarized data, again use the output predictor matrix from
# mice.binarize()
mids_boys <- mice(boys_bin$data, predictorMatrix = boys_bin$pred_matrix)

# Post-process the binarized columns. We use blocks and weights from the previous
# output, and set 'distmetric' and 'donors' as announced in the example
# description
post_boys <- mice.post.matching(mids_boys,
                               blocks = boys_bin$blocks,
                               weights = boys_bin$weights,
                               distmetric = "euclidian",
                               donors = 3L)

# Finally, we can retransform to the original format
res_boys <- mice.factorize(post_boys$midsobj, boys_bin$par_list)

## End(Not run)

miceExt documentation built on March 18, 2018, 1:18 p.m.