R/mice.post.matching.r

Defines functions mice.post.matching

Documented in mice.post.matching

#'Post-processing of Imputed Data by Multivariate Predictive Mean Matching
#'
#'Performs multivariate predictive mean matching (PMM) on a set of columns that
#'have been imputed on by the functionalities of the \code{mice} package.
#'Also offers a functionality to match imputations against observed variables.
#'
#'@param obj \code{mice::mids} object that has been returned by a previous run
#'  of \code{mice()} or \code{mice.mids()} and whose imputations we want to
#'  post-process.
#'@param blocks Vector or list of vectors specifying the column tuples that are
#'  to be imputed on blockwise. For a detailed explanation about the valid
#'  formats of this parameter, see section \strong{\emph{input formats}} below.
#'  Each element of each specified block has to be included in the
#'  \code{visitSequence} of the given \code{mids} object, and the imputation
#'  method that was applied on each of these columns has to be either \code{pmm}
#'  or \code{norm}. \cr Univariate blocks are also allowed if they have been
#'  imputed on via \code{mice.impute.norm()}, or if we want to match them
#'  against an observed variable that is specified in the parameter
#'  \code{match_vars}. \cr The default is \code{blocks = NULL}, which tells this
#'  function to automatically determine and impute on column blocks that have
#'  identical missing value patterns.
#'@param donors Integer or integer vector indicating the size of the donor pool
#'  among which a draw in the matching step is made. If only a single number of
#'  donors is specified, it is applied on all blocks, otherwise this parameter
#'  needs to be a vector with as many elements as there blocks, specifying for
#'  each of these blocks how many donors are to be drawn from. \cr
#'  The default is \code{donors = 5L}. Setting \code{donors = 1L} always selects
#'  the closest match (nearest neighbor), but is not recommended.
#'@param weights Numeric vector or list of numeric vectors that allocates
#'  weights to the columns of each block, giving us the possibility to punish or
#'  mitigate differences in certain columns of the data when computing the
#'  distances in the matching step to determine the donor pool. Further details
#'  on the valid formats of this parameter can be found below in section
#'  \strong{\emph{input formats}}. \cr
#'  The default is \code{weights = NULL}, meaning that no weights should be
#'  applied to any column at all. \cr
#'  Note that in order to avoid any ambiguities, specifying this parameter in
#'  list format is only allowed if blocks have been explicitly specified and NOT
#'  automatically determined.
#'@param distmetric Character string or character vector that determines which
#'  mathematical metric we want to use to compute the distances between the
#'  multivariate \code{y_obs} and \code{y_mis}. If only a single
#'  metric is specified, it is applied on all blocks, otherwise this parameter
#'  needs to be a vector with as many elements as there blocks, specifying for
#'  each of these blocks which metric there is to use. \cr
#'  The options are \code{"euclidian"}, \code{"manhattan"}, \code{"mahalanobis"}
#'  and \code{"residual"}. The first three options refer to the distance metrics
#'  of the same name, the latter is a variant of the mahalanobis distance in
#'  which we consider the residual covariance \code{cov(y_hat_obs - y_obs)} of
#'  the predicted model. This distance has been proposed and recommended by
#'  Little (1988) and is also the default. Note that the first two options are
#'  faster though, as they do not involve any matrix computations.
#'@param matchtype Integer of integer vector indicating the type of matching
#'  distance to be used in PMM. If only a single matchtype is specified, it is
#'  applied on all blocks, otherwise this parameter needs to be a vector with as
#'  many elements as there blocks, specifying for each of these blocks which
#'  matchtype has to be used. \cr
#'  The default choice (\code{matchtype = 1L}) calculates the distance between
#'  the \emph{predicted} values of \code{y_obs} and the \emph{drawn} values of
#'  \code{y_mis} (called type-1 matching). Other choices are \code{matchtype =
#'  0L} distance between predicted values) and \code{matchtype = 2L} (distance
#'  between drawn values).
#'@param match_vars Vector specifying for each tuple which additional variable
#'  has to be matched against. Can be an integer or character vector, either
#'  specifying column indices or column names. \code{match_vars} must be the
#'  same length as \code{blocks} with \code{0}-elements or \code{""}-elements to
#'  disable this functionality for certain groups. Default is \code{match_vars =
#'  NULL}, which completely disables this functionality.
#'@param ridge The ridge penalty used in an internal call of
#'  \code{mice::.norm.draw()} to prevent problems with multicollinearity. Can be
#'  a single number or a numeric vector. If only a single ridge value is
#'  specified, it is applied on all blocks, otherwise this parameter needs to be
#'  a vector with as many elements as there blocks, specifying for each of these
#'  blocks which ridge value has to be used. \cr
#'  The default is \code{ridge = 1e-05}, which means that 0.01 percent of the
#'  diagonal is added to the cross-product. Larger ridges may result in more
#'  biased estimates. For highly noisy data (e.g. many junk variables), set
#'  \code{ridge = 1e-06} or even lower to reduce bias. For highly collinear
#'  data, set \code{ridge = 1e-04} or higher.
#'@param minvar The minimum variance that we require predictors to have when
#'  building a linear model to compute the \code{y_hat}-values. Can be a single
#'  number or a numeric vector. If only a single value is specified, it is
#'  applied on all blocks, otherwise this parameter needs to be a vector with as
#'  many elements as there blocks, specifying for each of these blocks which
#'  minimum variance will b allowed. The default is \code{minvar = 1e-04}.
#'@param maxcor The maximum correlation that we allow predictors to have when
#'  building a linear model to compute the \code{y_hat}-values.  Can be a single
#'  number or a numeric vector. If only a single value is specified, it is
#'  applied on all blocks, otherwise this parameter needs to be a vector with as
#'  many elements as there blocks, specifying for each of these blocks which
#'  maximum variance will be allowed. The default is \code{maxcor = 0.99}.
#'  
#'  
#'@return List containing the following two elements:
#'\describe{
#'
#'  \item{\strong{\code{midsobj}}}{\code{mice::mids} object that differs from the input object
#'  only in the imputations that have been post-processed, and the \code{call}
#'  and \code{loggedEvents} attributes that have been updated. In particular,
#'  those post-processed imputations are not affecting the \code{chainMean} or
#'  \code{chainVar}-attributes, and hence, \code{plot()} will not consider them
#'  either.}
#'  
#'  \item{\strong{\code{blocks}}}{Set of column blocks in list format that multivariate
#'  imputation has been performed on. It is equivalent to the input parameter
#'  \code{blocks} if it has been specified by the user, otherwise those column
#'  tuples have been determined internally.}
#'}
#'
#'
#'@section Input Formats: 
#'  Within \code{mice.post.matching()} and \code{mice.binarize()}, there are two
#'  formats that can be used to specify the input parameters \code{blocks} and
#'  \code{weights}, namely the \emph{list format} and the \emph{vector format}.
#'  The basic idea behind the list format is that we exclusively specify
#'  parameters for those column blocks that we want to impute on and summarize
#'  them in a list where each element is a vector that represents one such
#'  block, while in the vector format we use a single vector in which each
#'  element represents one column in the data, and therefore also specify
#'  information for columns that we do not want to impute on. The exact use of
#'  these two formats on both the \code{blocks} and the \code{weights} parameter
#'  will be illustrated in the following.
#'  
#'  \describe{
#'  
#'    \item{\strong{\code{blocks}}}{
#'    \strong{1. List Format} \cr
#'    To specify the imputation \code{blocks} using the list format, a list of
#'    atomic vectors has to be passed to the \code{blocks} parameter, and each
#'    vector in this list represents a column block that has to be imputed on.
#'    These vectors can either contain the names of the columns in this block as
#'    character strings or the corresponding column indices in numerical format.
#'    Note that this input list must not contain any duplicate columns among and
#'    within its elements. If there is only a single block to impute on, a
#'    single atomic vector representing this tuple may also be passed to
#'    \code{blocks}.
#'    
#'    \bold{Example:}\cr 
#'    Within this and the following examples of this section, we consider the
#'    \code{mammal_data} data set which contains 11 columns, out of which the
#'    column tuples \code{(sws, ps)} and \code{(mls, gt)} have identical missing
#'    value patterns (check \code{?mammal_data} for further details on the
#'    data).\cr
#'    If we now wanted to specify these blocks in list format, we would have to
#'    write \cr
#'    \code{blocks = list(c("sws","ps"), c("mls", "gt"))} \cr
#'    or analogously, when using column indices instead, \cr
#'     \code{blocks = list(c(4,5), c(7,8))}.
#'
#'    \strong{2. Vector Format} \cr
#'    If we want to specify the imputation \code{blocks} via the vector format,
#'    a single vector with as many elements as there are columns in the data has
#'    to be used. Each element of this vector assigns a block number to its
#'    corresponding column, and columns that have the same number are imputed
#'    together, while columns that are not to be imputed have to carry the value
#'    \code{0}. All block numbers have to be integral, starting from \code{1},
#'    and the total number of imputation blocks has to be the maximum block
#'    number.
#'    
#'    \bold{Example:}\cr 
#'    Again we want to blockwise impute the column tuples \code{(sws, ps)} and
#'    \code{(mls, gt)} from \code{mammal_data}. To specify these blocks via the
#'    vector notation, we assign block number \code{1} to the columns of the
#'    first tuple and group number \code{2} to the columns of the second tuple,
#'    while all other columns have block number \code{0}. Hence, we would pass
#'    \cr
#'    \code{blocks = c(0,0,0,1,1,0,2,2,0,0,0)} \cr
#'    to \code{mice.post.matching()}.
#'    }
#'
#'    \item{\strong{\code{weights}}}{
#'    \strong{1. List Format} \cr 
#'    To specify the imputation \code{weights} using the list format, the
#'    corresponding imputations blocks must have been specified in list format
#'    as well. In this case, the \code{weights} list has to be the same length
#'    as blocks, and each of its elements has a numeric vector of the same
#'    length as its corresponding column block, thereby assigning each of its
#'    columns a (strictly postitive) weight. If we do not want to apply weights
#'    to a single tuple, we can write a single \code{0}, \code{1} or \code{NULL}
#'    in the corresponding spot of the list.
#'    
#'    \bold{Example:}\cr 
#'    In our example, we want to assign the \code{sws} column a \code{1.5} times
#'    higher weight than \code{ps}, while not assigning any values to the second
#'    tuple at all. To achieve that, we specify \cr
#'    \code{weights = list(c(3,2), NULL)}.
#'    
#'    \strong{2. Vector Format} \cr
#'    When specifying the imputation \code{weights} via the vector format, once
#'    again a single vector with as many elements as there are columns in the
#'    data has to be used, in which each element assigns a weight to its
#'    corresponding column. Weights of columns that are not imputed on will have
#'    no effect, while in all blocks that weights should not be applied on, each
#'    column should carry the same value. In general, the value \code{1} should
#'    be used as the standard weight value for all columns that either are not
#'    imputed on or that weights are not applied on.
#'    
#'    \bold{Example:}\cr 
#'    We want to assign the same weights to the first tuple as in the previous
#'    example, while not assigning weights the second block again. In this case,
#'    we would use the vector \cr
#'    \code{weights = c(1,1,1,3,2,1,1,1,1,1,1)} \cr
#'    in which all the values of the unimputed and unweighted columns are
#'    \code{1}.
#'    }
#'  
#'  }
#'  
#'  Internally, \code{mice.post.matching()} converts both parameters into the
#'  list format as this is more natural to iterate on. Hence, the output
#'  \code{blocks} parameter also is in list format.
#'
#'@section Algorithmic Details:
#'  The algorithm basically iterates over the \code{m} imputations of the
#'  input \code{mice::mids} object and each column tuple in \code{blocks}, each
#'  time following these two main steps:
#'  \describe{
#'    \item{\strong{Prediction}}{First, we iterate over the columns of the
#'    current block, collecting the complete column \eqn{y} on which we want to
#'    impute, along with the matrix \eqn{x} of its predictors. Among the values
#'    of eqn{y}, we identify the observed values \eqn{y_{obs}} and the missing
#'    values \code{y_{mis}} along with their corresponding predictors
#'    \eqn{x_{obs}} and \code{x_{mis}}, restricting ourselves to only those
#'    values whose predictors are complete, i.e. don't contain any missing
#'    values themselves. Note that if the sets of predictor variables strongly
#'    vary among all the columns in the current tuple, it may occur that there
#'    are no common predictors in which case the function breaks. \cr
#'    From the observed values and their predictors, we build a Bayesian linear
#'    regression model and, depending on the specified matching type, use the
#'    model's coefficients and drawn regression weights \eqn{\beta} to compute
#'    the predicted means \eqn{\hat{y}_{obs}} and \eqn{\hat{y}_{mis}} which we
#'    then save in a matrix \eqn{\hat{Y}}.}
#'    
#'    \item{\strong{Matching}}{After iterating over the columns in the current
#'    tuple and building the matrix \eqn{\hat{Y}}, we match the multivariate
#'    predictions of the missing values in \eqn{\hat{Y}} against the predictions
#'    of the observed values. More precisely, for each \eqn{\hat{y}_{mis}}, we
#'    perform a k-nearest-neighbor search among all values \eqn{\hat{y}_{obs}},
#'    where \eqn{k} is the number of donors specified by the user, and then
#'    randomly sample one element of those \eqn{k} nearest neighbors which is
#'    then used to impute the missing value tuple in \eqn{y}. The distance
#'    metric that is used to determine the nearest neighbors is specified by the
#'    user as well, and in case of Euclidian or Manhattan distance its
#'    computation is very straightforward. If the Mahalanobis distance or the
#'    residual distance (as proposed by Little) has been selected, we first
#'    compute the corresponding covariance matrix and its (pseudo) inverse via
#'    its eigen decomposition, and then use it to transform the values
#'    \eqn{\hat{y}_{mis}} and \eqn{\hat{y}_{obs}} that are fed into the nearest
#'    neighbor search. If weights have been specified as well, they are also
#'    applied before the kNN-search. \cr
#'    If an additional observed variable to match against has been specified via
#'    \code{match_vars}, the set of predictors and missing values are
#'    partitioned by the values of the external variable first, and then the
#'    matching is performed within each pair of corresponding subsets of that
#'    partition.}
#'
#'    Note that the imputed values are only stored as a result and, other than
#'    in the \code{mice} algorithm, are not used to compute predictive means for
#'    any other missing value. We exclusively use the imputed values that are
#'    provided within the input \code{mids} object.
#'  }
#'
#'@references 
#'  Little, R.J.A. (1988), Missing data adjustments in large surveys (with
#'  discussion), Journal of Business Economics and Statistics, 6, 287--301.
#'
#'  Van Buuren, S. (2012). Flexible Imputation of Missing Data. CRC/Chapman \&
#'  Hall, Boca Raton, FL.
#'
#'  Van Buuren, S., Groothuis-Oudshoorn, K. (2011). \code{mice}: Multivariate
#'  Imputation by Chained Equations in \code{R}. \emph{Journal of Statistical
#'  Software}, \bold{45}(3), 1-67. \url{http://www.jstatsoft.org/v45/i03/}
#'
#'@author Tobias Schumacher, Philipp Gaffert
#'
#'@examples
#'
#'
#'\dontrun{
#'#------------------------------------------------------------------------------
#'# Example on modified 'mammalsleep' data set from mice, that has identical
#'# missing data patterns on the column tuples ('ps','sws') and ('mls','gt')
#'#------------------------------------------------------------------------------
#'
#'# run mice on data set 'mammal_data' and obtain a mids object to post-process
#'mids_mammal <- mice(mammal_data)
#'
#'
#'# run function, as blocks have not been specified, it will automatically detect
#'# the column tuples with identical missing data patterns and then impute on
#'# these
#'post_mammal <- mice.post.matching(mids_mammal)
#'
#'# read out which column tuples have been imputed on
#'post_mammal$blocks
#'
#'# look into imputations within resulting mice::mids object
#'post_mammal$midsobj$imp
#'
#'
#'
#'#------------------------------------------------------------------------------
#'# Example on original 'mammalsleep' data set from mice, in which we
#'# want to post-process the imputations in column 'sws' by only imputing values
#'# from rows whose value in 'pi' matches the value of 'pi' in the row we impute
#'# on.
#'#------------------------------------------------------------------------------
#'
#'# run mice on data set 'mammal_data' and obtain a mids object to post-process
#'mids_mammal <- mice(mammalsleep)
#'
#'
#'# run function, specify 'sws' as the column to impute on, and specify 'pi' as
#'# the observed variable to consider in the matching
#'post_mammal <- mice.post.matching(mids_mammal, blocks = "sws", match_vars = "pi")
#'
#'
#'# look into imputations within resulting mice::mids object
#'post_mammal$midsobj$imp
#'
#'
#'
#'#------------------------------------------------------------------------------
#'# Examples illustrating the combined usage of blocks and weights, relating to
#'# the examples in the input format section above. As before, we want to impute  
#'# on the column tuples ('ps','sws') and ('mls','gt') from mammal_data, while  
#'# this time assigning weights to the first block, in which 'ps' gets a 1.5 times
#'# higher weight than 'sws'. The second tuple is not weighted. 
#'#------------------------------------------------------------------------------
#'
#'# run mice() first
#'mids_mammal <- mice(mammal_data)
#'
#'## Now there are five options to specify the blocks and weights:
#'
#'# First option: specify blocks and weights in list format
#'post_mammal <- mice.post.matching(obj = mids_mammal,
#'                                  blocks = list(c("sws","ps"), c("mls","gt")),
#'                                  weights = list(c(3,2), NULL))
#'
#'# or equivalently, using colums indices:                                  
#'post_mammal <- mice.post.matching(obj = mids_mammal,
#'                                  blocks = list(c(4,5), c(7,8)),
#'                                  weights = list(c(3,2), NULL))
#'
#'# Second option: specify blocks and weights in vector format
#'post_mammal <- mice.post.matching(obj = mids_mammal,
#'                                  blocks = c(0,0,0,1,1,0,2,2,0,0,0),
#'                                  weights = c(1,1,1,3,2,1,1,1,1,1,1))
#'
#'# Third option: specify blocks in list format and weights in vector format
#'post_mammal <- mice.post.matching(obj = mids_mammal,
#'                                  blocks = list(c("sws","ps"), c("mls","gt")),
#'                                  weights = c(1,1,1,3,2,1,1,1,1,1,1))
#'
#'# Fourth option: specify blocks in vector format and weights in list format.
#'# Note that the block number determines which tuple in the weights list it
#'# corresponds to, and within each tuple in the list the weight correspondence is
#'# determinded by left to right order of the data columns
#'post_mammal <- mice.post.matching(obj = mids_mammal,
#'                                  blocks = c(0,0,0,1,1,0,2,2,0,0,0),
#'                                  weights = list(c(3,2), NULL))
#'
#'# Fifth option: only specify weights in vector format. If the user knows
#'# beforehand that at least the column tuple he wants to impute and use weights
#'# on have the same missing value patterns, he can assign weights to these using
#'# the vector format, while letting mice.post.matching() find all other blocks
#'# with identical missing value patterns - possibly even more than just
#'# ('ps','sws') and ('mls','gt')
#'post_mammal <- mice.post.matching(obj = mids_mammal,
#'                                  weights = c(1,1,1,3,2,1,1,1,1,1,1))
#'
#'
#'
#'#------------------------------------------------------------------------------
#'# Example that illustrates the combined functionalities of mice.binarize(),
#'# mice.factorize() and mice.post.matching() on the data set 'boys_data', which
#'# contains the column blocks ('hgt','bmi') and ('hc','gen','phb') that have
#'# identical missing value patterns, and out of which the columns 'gen' and
#'# 'phb' are factors. We are going to impute both tuples blockwise, while
#'# binarizing the factor columns first. Note that we never need to specify any
#'# blocks or columns to binarize, as these are all determined automatically 
#'#------------------------------------------------------------------------------
#'
#'# By default, mice.binarize() expands all factor columns that contain NAs,
#'# so the columns 'gen' and 'phb' are automatically binarized
#'boys_bin <- mice.binarize(boys_data)
#'
#'# Run mice on binarized data, note that we need to use boys_bin$data to grab
#'# the actual binarized data and that we use the output predictor matrix
#'# boys_bin$pred_matrix which is recommended for obtaining better imputation
#'# models
#'mids_boys <- mice(boys_bin$data, predictorMatrix = boys_bin$pred_matrix)
#'
#'# It is very likely that mice imputed multiple ones among one set of dummy
#'# variables, so we need to post-process. As recommended, we also use the output
#'# weights from mice.binarize(), which yield a more balanced weighting on the
#'# column tuple ('hc','gen','phb') within the matching. As in previous examples,
#'# both tuples are automatically discovered and imputed on
#'post_boys <- mice.post.matching(mids_boys, weights = boys_bin$weights)
#'
#'# Now we can safely retransform to the original data, with non-binarized
#'# imputations
#'res_boys <- mice.factorize(post_boys$midsobj, boys_bin$par_list)
#'
#'# Analyze the distribution of imputed variables, e.g. of the column 'gen',
#'# using the mice version of with()
#'with(res_boys, table(gen))
#'
#'
#'
#'#------------------------------------------------------------------------------
#'# Similar example to the previous, that also works on 'boys_data' and
#'# illustrates some more advanced funtionalities of all three functions in miceExt: 
#'# This time we only want to post-process the column block ('gen','phb'), while
#'# weighting the first of these tuples twice as much as the second. Within the
#'# matching, we want to avoid matrix computations by using the euclidian distance
#'# to determine the donor pool, and we want to draw from three donors only.
#'#------------------------------------------------------------------------------
#'
#'# Binarize first, we specify blocks in list format with a single block, so we 
#'# can omit an enclosing list. Similarly, we also specify weights in list format.
#'# Both blocks and weights will be expanded and can be accessed from the output
#'# to use them in mice.post.matching() later
#'boys_bin <- mice.binarize(boys_data, 
#'                          blocks = c("gen", "phb"), 
#'                          weights = c(2,1))
#'
#'# Run mice on binarized data, again use the output predictor matrix from
#'# mice.binarize()
#'mids_boys <- mice(boys_bin$data, predictorMatrix = boys_bin$pred_matrix)
#'
#'# Post-process the binarized columns. We use blocks and weights from the previous
#'# output, and set 'distmetric' and 'donors' as announced in the example
#'# description
#'post_boys <- mice.post.matching(mids_boys,
#'                                blocks = boys_bin$blocks,
#'                                weights = boys_bin$weights,
#'                                distmetric = "euclidian",
#'                                donors = 3L)
#'
#'# Finally, we can retransform to the original format
#'res_boys <- mice.factorize(post_boys$midsobj, boys_bin$par_list)
#'}
#'
#'
#'@seealso \code{\link[mice]{mice}}, \code{\link[mice]{mids-class}},
#'  \code{\link[miceExt]{mice.binarize}}, \code{\link[miceExt]{mice.factorize}}
#'@export
mice.post.matching <- function(obj, blocks = NULL, donors = 5L, weights = NULL, distmetric = "residual", matchtype = 1L, match_vars = NULL, ridge = 1e-05, minvar = 1e-04, maxcor = 0.99)
{

  ##  Check whether all input arguments are valid

  # check whether input mids object is valid
  if (!is.mids(obj)) stop("Argument 'obj' should be of type mids.")

	# check whether input set of column tuples is valid
  # if blocks haven't been specified, look for columns with identical NA distributions, otherwise check whether input blocks are valid
  if(is.null(blocks))
  {
    # use internal helper function to find blocks
    blocks <- find_blocks(obj)
    if(length(blocks) == 0)
      stop("There are no column tuples with identical missing data patterns and valid imputation methods.\n")
    blocks_specified <- FALSE
  }
  else
  {
    # delegate to helper check function
    blocks <- check_blocks(obj, blocks)
    blocks_specified <- TRUE
  }
  
  # check whether match_vars are valid
  match_vars <- check_match_vars(obj, blocks, match_vars)

  # check whether input list/vector of weights is valid and convert it to list format if necessary
  weights <- check_weights(weights, blocks, ncol(obj$data), blocks_specified)
  
  # check validity of other optional parameters and convert/expand them if necessary
  optionals <- list(donors = donors, distmetric = distmetric, matchtype = matchtype, ridge = ridge , minvar = minvar, maxcor = maxcor)
  optionals <- check_optionals(optionals, length(blocks))


  ## Initialize base parameters

  # get size of data frame
  data <- obj$data
  ncols <- ncol(data)
  nrows <- nrow(data)

  # grab function call
  call <- match.call()

  # initialize more base parameters, e.g. make explicit local copies
  where <- obj$where
  r <- !is.na(data)

  # now perform a deep check on whether given blocks and match_vars work on the given data, and retrieve return setup values
  setup <- check_deep(obj, blocks, match_vars)

  partitions_list <- setup$partitions_list
  complete_R <- setup$complete_R
  complete_W <- setup$complete_W

	# initialize list of resulting imputations
	res_imp <- obj$imp

  # grab event log which is needed in some subfunctions of mice
  loggedEvents <- obj$loggedEvents
  state <- list(it = 0, im = 0, co = 0, dep = "", meth = "",
                log = !is.null(loggedEvents))
  if (is.null(loggedEvents))
    loggedEvents <- data.frame(it = 0, im = 0, co = 0, dep = "",
                               meth = "", out = "")



  ######################################################################################################
	#  CENTRAL ITERATION
  ######################################################################################################

  # iterate over each set of imputations
  for(m in 1:obj$m)
  {
    # build current data set consisting of observed and imputed data from current imputation
    for (j in obj$visitSequence)
    {
      wy <- where[, j]
      ry <- r[, j]

      # copy from obj$imp to avoid using filled in values from multivariate match
      data[(!ry) & wy, j] <- obj$imp[[j]][(!ry)[wy], m]
    }

    # iterate over all column tuples
	  for(i in seq_along(blocks))
	  {
	    # grab current column tuple and correspoding weights
	    tuple <- blocks[[i]]
	    dimweights <- weights[[i]]
	    
	    # grab current values of optional parameters
	    donors <- optionals$donors[[i]]
	    distmetric <- optionals$distmetric[[i]]
	    matchtype <- optionals$matchtype[[i]]
	    ridge <- optionals$ridge[[i]]
	    minvar <- optionals$minvar[[i]]
	    maxcor <- optionals$maxcor[[i]]

	    # if tuple is univariate, just perform standard PMM and skip to next tuple
	    if(length(tuple) == 1)
	    {
	      # get current y column
	      y <- data[, tuple]

	      # get current predictors
	      predictors <- obj$predictorMatrix[tuple, ] == 1
	      x <- data[, predictors, drop = FALSE]
	      x <- expand_factors(x)

	      # filter down to rows that actually have a complete non-NA set of predictors
	      ry <- complete.cases(x) & r[, tuple]
	      wy <- complete.cases(x) & where[, tuple]

	      # remove linear dependent predictors
	      keep <- call_remove_lin_dep(x, y, ry, minvar = minvar, maxcor = maxcor)
	      x <- x[, keep, drop = FALSE]

	      # grab current partition and split matching if necessary
	      match_partitions <- partitions_list[[i]]

	      # perform standard pmm procedure with custom univariate predictions and matching
	      y_hat <- get_pmm_prediction(y, ry, x, wy = wy, matchtype = matchtype, ridge = ridge)
	      res_imp[[tuple]][, m] <- match_univariate(y = as.matrix(y), y_hat = as.matrix(y_hat), ry = ry, wy = wy, donors = donors, match_partitions = match_partitions)

	      next
	    }

	    # build matrix of predictive means
	    y_hat_matrix <- do.call(cbind, lapply(tuple,
	      function(j)
	      {
	        # get current y column
	        y <- data[, j]

	        # get predictor matrix
	        predictors <- obj$predictorMatrix[j, ] == 1
	        x <- data[, predictors, drop = FALSE]
	        x <- expand_factors(x)

	        # filter down to rows that actually have a complete non-NA set of predictors
	        ry <- complete.cases(x) & r[, j]
	        wy <- complete.cases(x) & where[, j]

	        # remove linear dependent predictors
	        keep <- call_remove_lin_dep(x, y, ry, minvar = minvar, maxcor = maxcor)
	        x <- x[, keep, drop = FALSE]

	        # apply current imputation/regression method to obain y_hat
	        return(get_pmm_prediction(y, ry, x, wy = wy, matchtype = matchtype, ridge = ridge))
	      }))

      # prepare multivariate matching, get whole data on current tuple
	    y <- data[, tuple]

      # grab rows of common complete observed and missing data per tuple
	    complete_ry <- complete_R[[i]]
	    complete_wy <- complete_W[[i]]

	    #grab current partition
	    match_partitions <- partitions_list[[i]]

	    # now perform multivariate PMM in current tuple
	    imputes <- match_multivariate(y = y, y_hat = y_hat_matrix, ry = complete_ry, wy = complete_wy, donors = donors, distmetric = distmetric, dimweights = dimweights, match_partitions = match_partitions )

	    # fill in imputed values
	    for (j in seq_along(tuple))
	    {
	      res_imp[[tuple[j]]][, m] <- imputes[,j]
	    }
	  }

  }

  # logging stuff
  if (!state$log)
    loggedEvents <- NULL
  if (state$log)
    row.names(loggedEvents) <- seq_len(nrow(loggedEvents))

	## save, and return
  midsobj <- list(call = call,
                  data = obj$data,
                  where = obj$where,
                  m = obj$m,
                  nmis = obj$nmis,
                  imp = res_imp,
                  method = obj$method,
                  predictorMatrix = obj$predictorMatrix,
                  visitSequence = obj$visitSequence,
                  pad = obj$pad,
                  post = obj$post,
                  seed = obj$seed,
                  iteration = obj$iteration,
                  lastSeedValue = obj$lastSeedValue,
                  chainMean = obj$chainMean,
                  chainVar = obj$chainVar,
                  loggedEvents = loggedEvents)

  oldClass(midsobj) <- "mids"

  return(list(midsobj = midsobj, blocks = blocks))
}

Try the miceExt package in your browser

Any scripts or data that you put into this service are public.

miceExt documentation built on March 18, 2018, 1:18 p.m.