#' Generate missing data for simulation purposes
#'
#' This function generates multivariate missing data under a MCAR, MAR or MNAR
#' missing data mechanism. Imputation of data sets containing missing values can
#' be performed with \code{\link{mice}}.
#'
#' This function generates missing values in complete data sets. Amputation of complete
#' data sets is useful for the evaluation of imputation techniques, such as multiple
#' imputation (performed with function \code{\link{mice}} in this package).
#'
#' The basic strategy underlying multivariate imputation was suggested by
#' Don Rubin during discussions in the 90's. Brand (1997) created one particular
#' implementation, and his method found its way into the FCS paper
#' (Van Buuren et al, 2006).
#'
#' Until recently, univariate amputation procedures were used to generate missing
#' data in complete, simulated data sets. With this approach, variables are made
#' incomplete one variable at a time. When more than one variable needs to be amputed,
#' the procedure is repeated multiple times.
#'
#' With the univariate approach, it is difficult to relate the missingness on one
#' variable to the missingness on another variable. A multivariate amputation procedure
#' solves this issue and moreover, it does justice to the multivariate nature of
#' data sets. Hence, \code{ampute} is developed to perform multivariate amputation.
#'
#' The idea behind the function is the specification of several missingness
#' patterns. Each pattern is a combination of variables with and without missing
#' values (denoted by \code{0} and \code{1} respectively). For example, one might
#' want to create two missingness patterns on a data set with four variables. The
#' patterns could be something like: \code{0,0,1,1} and \code{1,0,1,0}.
#' Each combination of zeros and ones may occur.
#'
#' Furthermore, the researcher specifies the proportion of missingness, either the
#' proportion of missing cases or the proportion of missing cells, and the relative
#' frequency each pattern occurs. Consequently, the data is split into multiple subsets,
#' one subset per pattern. Now, each case is candidate for a certain missingness pattern,
#' but whether the case will have missing values eventually depends on other specifications.
#'
#' The first of these specifications is the missing mechanism. There are three possible
#' mechanisms: the missingness depends completely on chance (MCAR), the missingness
#' depends on the values of the observed variables (i.e. the variables that remain
#' complete) (MAR) or on the values of the variables that will be made incomplete (MNAR).
#'
#' When the user specifies the missingness mechanism to be \code{"MCAR"}, the candidates
#' have an equal probability of becoming incomplete. For a \code{"MAR"} or \code{"MNAR"} mechanism,
#' weighted sum scores are calculated. These scores are a linear combination of the
#' variables.
#'
#' In order to calculate the weighted sum scores, the data is standardized. For this reason,
#' the data has to be numeric. Second, for each case, the values in
#' the data set are multiplied with the weights, specified by argument \code{weights}.
#' These weighted scores will be summed, resulting in a weighted sum score for each case.
#'
#' The weights may differ between patterns and they may be negative or zero as well.
#' Naturally, in case of a MAR mechanism, the weights corresponding to the
#' variables that will be made incomplete, have a 0. Note that this may be
#' different for each pattern. In case of MNAR missingness, especially
#' the weights of the variables that will be made incomplete are of importance. However,
#' the other variables may be weighted as well.
#'
#' It is the relative difference between the weights that will result in an effect
#' in the sum scores. For example, for the first missing data
#' pattern mentioned above, the weights for the third and fourth variables could
#' be set to 2 and 4. However, weight values of 0.2 and 0.4 will have the exact
#' same effect on the weighted sum score: the fourth variable is weighted twice as
#' much as variable 3.
#'
#' Based on the weighted sum scores, either a discrete or continuous distribution
#' of probabilities is used to calculate whether a candidate will have missing values.
#'
#' For a discrete distribution of probabilities, the weighted sum scores are
#' divided into subgroups of equal size (quantiles). Thereafter, the user
#' specifies for each subgroup the odds of being missing. Both the number of
#' subgroups and the odds values are important for the generation of missing data.
#' For example, for a RIGHT-like mechanism, scoring in one of the
#' higher quantiles should have high missingness odds, whereas for a MID-like
#' mechanism, the central groups should have higher odds. Again, not the size of
#' the odds values are of importance, but the relative distance between the values.
#'
#' The continuous distributions of probabilities are based on the logistic distribution function.
#' The user can specify the type of missingness, which, again, may differ between patterns.
#'
#' For an example and more explanation about how the arguments interact with
#' each other, we refer to the vignette:
#' \href{https://rianneschouten.github.io/mice_ampute/vignette/ampute.html}{Generate missing values with ampute}.
#'
#' @param data A complete data matrix or data frame. Values should be numeric.
#' Categorical variables should have been transformed to dummies.
#' @param prop A scalar specifying the proportion of missingness. Should be a value
#' between 0 and 1. Default is a missingness proportion of 0.5.
#' @param patterns A matrix or data frame of size #patterns by #variables where
#' \code{0} indicates that a variable should have missing values and \code{1} indicates
#' that a variable should remain complete. The user may specify as many patterns as
#' desired. One pattern (a vector) is possible as well. Default
#' is a square matrix of size #variables where each pattern has missingness on one
#' variable only (created with \code{\link{ampute.default.patterns}}). After the
#' amputation procedure, \code{\link{md.pattern}} can be used to investigate the
#' missing data patterns in the data.
#' @param freq A vector of length #patterns containing the relative frequency with
#' which the patterns should occur. For example, for three missing data patterns,
#' the vector could be \code{c(0.4, 0.4, 0.2)}, meaning that of all cases with
#' missing values, 40 percent should have pattern 1, 40 percent pattern 2 and 20
#' percent pattern 3. The vector should sum to 1. Default is an equal probability
#' for each pattern, created with \code{\link{ampute.default.freq}}.
#' @param mech A string specifying the missingness mechanism, either "MCAR"
#' (Missing Completely At Random), "MAR" (Missing At Random) or "MNAR" (Missing Not At
#' Random). Default is a MAR missingness mechanism.
#' @param weights A matrix or data frame of size #patterns by #variables. The matrix
#' contains the weights that will be used to calculate the weighted sum scores. For
#' a MAR mechanism, the weights of the variables that will be made incomplete should be
#' zero. For a MNAR mechanism, these weights could have any possible value. Furthermore,
#' the weights may differ between patterns and between variables. They may be negative
#' as well. Within each pattern, the relative size of the values are of importance.
#' The default weights matrix is made with \code{\link{ampute.default.weights}} and
#' returns a matrix with equal weights for all variables. In case of MAR, variables
#' that will be amputed will be weighted with \code{0}. For MNAR, variables
#' that will be observed will be weighted with \code{0}. If the mechanism is MCAR, the
#' weights matrix will not be used.
#' @param std Logical. Whether the weighted sum scores should be calculated with
#' standardized data or with non-standardized data. The latter is especially advised when
#' making use of train and test sets in order to prevent leakage.
#' @param cont Logical. Whether the probabilities should be based on a continuous
#' or a discrete distribution. If TRUE, the probabilities of being missing are based
#' on a continuous logistic distribution function. \code{\link{ampute.continuous}}
#' will be used to calculate and assign the probabilities. These probabilities will then
#' be based on the argument \code{type}. If FALSE, the probabilities of being missing are
#' based on a discrete distribution (\code{\link{ampute.discrete}}) based on the \code{odds}
#' argument. Default is TRUE.
#' @param type A string or vector of strings containing the type of missingness for each
#' pattern. Either \code{"LEFT"}, \code{"MID"}, \code{"TAIL"} or '\code{"RIGHT"}.
#' If a single missingness type is given, all patterns will be created with the same
#' type. If the missingness types should differ between patterns, a vector of missingness
#' types should be given. Default is RIGHT for all patterns and is the result of
#' \code{\link{ampute.default.type}}.
#' @param odds A matrix where #patterns defines the #rows. Each row should contain
#' the odds of being missing for the corresponding pattern. The number of odds values
#' defines in how many quantiles the sum scores will be divided. The odds values are
#' relative probabilities: a quantile with odds value 4 will have a probability of
#' being missing that is four times higher than a quantile with odds 1. The
#' number of quantiles may differ between the patterns, specify NA for cells remaining empty.
#' Default is 4 quantiles with odds values 1, 2, 3 and 4 and is created by
#' \code{\link{ampute.default.odds}}.
#' @param bycases Logical. If TRUE, the proportion of missingness is defined in
#' terms of cases. If FALSE, the proportion of missingness is defined in terms of
#' cells. Default is TRUE.
#' @param run Logical. If TRUE, the amputations are implemented. If FALSE, the
#' return object will contain everything except for the amputed data set.
#'
#' @return Returns an S3 object of class \code{\link{mads}} (multivariate
#' amputed data set)
#' @author Rianne Schouten, Gerko Vink, Peter Lugtig, 2016
#' @seealso \code{\link{mads}}, \code{\link{bwplot.mads}},
#' \code{\link{xyplot.mads}}
#'
#' @references
#' Brand, J.P.L. (1999) \emph{Development, implementation and
#' evaluation of multiple imputation strategies for the statistical analysis of
#' incomplete data sets.} pp. 110-113. Dissertation. Rotterdam: Erasmus University.
#'
#' Schouten, R.M., Lugtig, P and Vink, G. (2018)
#' Generating missing values for simulation purposes: A multivariate
#' amputation procedure.
#' \emph{Journal of Statistical Computation and Simulation}, 88(15): 1909-1930.
#' \doi{10.1080/00949655.2018.1491577}
#'
#' Schouten, R.M. and Vink, G. (2018) The Dance of the Mechanisms: How Observed
#' Information Influences the Validity of Missingness Assumptions.
#' \emph{Sociological Methods and Research}, 50(3): 1243-1258.
#' \doi{10.1177/0049124118799376}
#'
#' Van Buuren, S., Brand, J.P.L., Groothuis-Oudshoorn, C.G.M., Rubin, D.B. (2006)
#' Fully conditional specification in multivariate imputation.
#' \emph{Journal of Statistical Computation and Simulation}, 76(12): 1049-1064.
#' \doi{10.1080/10629360600810434}
#'
#' Van Buuren, S. (2018).
#' \emph{Flexible Imputation of Missing Data. Second Edition.}
#' Chapman & Hall/CRC. Boca Raton, FL.
#'
#' Vink, G. (2016) Towards a standardized evaluation of multiple imputation routines.
#' @examples
#' # start with a complete data set
#' compl_boys <- cc(boys)[1:3]
#'
#' # Perform amputation with default settings
#' mads_boys <- ampute(data = compl_boys)
#' mads_boys$amp
#'
#' # Change default matrices as desired
#' my_patterns <- mads_boys$patterns
#' my_patterns[1:3, 2] <- 0
#'
#' my_weights <- mads_boys$weights
#' my_weights[2, 1] <- 2
#' my_weights[3, 1] <- 0.5
#'
#' # Rerun amputation
#' my_mads_boys <- ampute(
#' data = compl_boys, patterns = my_patterns, freq =
#' c(0.3, 0.3, 0.4), weights = my_weights, type = c("RIGHT", "TAIL", "LEFT")
#' )
#' my_mads_boys$amp
#' @export
ampute <- function(data, prop = 0.5, patterns = NULL, freq = NULL,
mech = "MAR", weights = NULL, std = TRUE, cont = TRUE,
type = NULL, odds = NULL,
bycases = TRUE, run = TRUE) {
if (is.null(data)) {
stop("Argument data is missing, with no default", call. = FALSE)
}
data.in <- data # preserve an original set to inject the NA's in later
data <- check.dataform(data)
if (anyNA(data)) {
stop("Data cannot contain NAs", call. = FALSE)
}
if (ncol(data) < 2) {
stop("Data should contain at least two columns", call. = FALSE)
}
data <- data.frame(data)
if (any(vapply(data, Negate(is.numeric), logical(1))) && mech != "MCAR") {
data <- as.data.frame(sapply(data, as.numeric))
warning("Data is made numeric internally, because the calculation of weights requires numeric data",
call. = FALSE
)
}
if (prop < 0 || prop > 100) {
stop("Proportion of missingness should be a value between 0 and 1 (for a proportion) or between 1 and 100 (for a percentage)",
call. = FALSE
)
} else if (prop > 1) {
prop <- prop / 100
}
if (is.null(patterns)) {
patterns <- ampute.default.patterns(n = ncol(data))
} else if (is.vector(patterns) && (length(patterns) / ncol(data)) %% 1 == 0) {
patterns <- matrix(patterns, nrow = length(patterns) / ncol(data), byrow = TRUE)
if (nrow(patterns) == 1 && all(patterns[1, ] %in% 1)) {
stop("One pattern with merely ones results to no amputation at all, the procedure is therefore stopped", call. = FALSE)
}
} else if (is.vector(patterns)) {
stop("Length of pattern vector does not match #variables", call. = FALSE)
}
patterns <- data.frame(patterns)
if (is.null(freq)) {
freq <- ampute.default.freq(patterns = patterns)
}
if (!is.vector(freq)) {
freq <- as.vector(freq)
warning("Frequency should be a vector", call. = FALSE)
}
if (length(freq) != nrow(patterns)) {
if (length(freq) > nrow(patterns)) {
freq <- freq[seq_along(nrow(patterns))]
} else {
freq <- c(freq, rep.int(0.2, nrow(patterns) - length(freq)))
}
warning(paste("Length of vector with relative frequencies does not match #patterns and is therefore changed to", freq), call. = FALSE)
}
if (sum(freq) != 1) {
freq <- recalculate.freq(freq = freq)
}
check.pat <- check.patterns(
patterns = patterns,
freq = freq,
prop = prop
)
patterns.new <- check.pat[["patterns"]]
freq <- check.pat[["freq"]]
prop <- check.pat[["prop"]]
if (!bycases) {
prop <- recalculate.prop(
prop = prop,
freq = freq,
patterns = patterns.new,
k = ncol(data),
n = nrow(data)
)
}
if (any(!mech %in% c("MCAR", "MAR", "MNAR"))) {
stop("Mechanism should be either MCAR, MAR or MNAR", call. = FALSE)
}
if (!is.vector(mech)) {
mech <- as.vector(mech)
warning("Mechanism should contain merely MCAR, MAR or MNAR", call. = FALSE)
} else if (length(mech) > 1) {
mech <- mech[1]
warning("Mechanism should contain merely MCAR, MAR or MNAR. First element is used",
call. = FALSE
)
}
# Check if there is a pattern with merely zeroos
if (!is.null(check.pat[["row.zero"]]) && mech == "MAR") {
stop(paste("Patterns object contains merely zeros and this kind of pattern is not possible when mechanism is MAR"),
call. = FALSE
)
}
if (mech == "MCAR" && !is.null(weights)) {
weights <- NULL
warning("Weights matrix is not used when mechanism is MCAR", call. = FALSE)
}
if (mech == "MCAR" && !is.null(odds)) {
odds <- NULL
warning("Odds matrix is not used when mechanism is MCAR", call. = FALSE)
}
if (mech != "MCAR" && !is.null(weights)) {
if (is.vector(weights) && (length(weights) / ncol(data)) %% 1 == 0) {
weights <- matrix(weights, nrow = length(weights) / ncol(data), byrow = TRUE)
} else if (is.vector(weights)) {
stop("Length of weight vector does not match #variables", call. = FALSE)
} else if (!is.matrix(weights) && !is.data.frame(weights)) {
stop("Weights matrix should be a matrix", call. = FALSE)
}
}
if (is.null(weights)) {
weights <- ampute.default.weights(
patterns = patterns.new,
mech = mech
)
}
weights <- as.data.frame(weights)
if (!nrow(weights) == nrow(patterns.new)) {
if (!is.null(check.pat[["row.one"]])) {
weights <- weights[-check.pat[["row.one"]], ]
}
}
if (!nrow(weights) == nrow(patterns.new)) {
stop("The objects patterns and weights are not matching", call. = FALSE)
}
if (!is.vector(cont)) {
cont <- as.vector(cont)
warning("Continuous should contain merely TRUE or FALSE", call. = FALSE)
} else if (length(cont) > 1) {
cont <- cont[1]
warning("Continuous should contain merely TRUE or FALSE. First element is used",
call. = FALSE
)
}
if (!is.logical(cont)) {
stop("Continuous should contain TRUE or FALSE", call. = FALSE)
}
if (cont && !is.null(odds)) {
odds <- NULL
warning("Odds matrix is not used when continuous probabilities (cont == TRUE) are specified",
call. = FALSE
)
}
if (!cont && !is.null(type)) {
type <- NULL
warning("Type is not used when discrete probabilities (cont == FALSE) are specified",
call. = FALSE
)
}
if (is.null(type)) {
type <- ampute.default.type(patterns = patterns.new)
}
if (any(!type %in% c("LEFT", "MID", "TAIL", "RIGHT"))) {
stop("Type should contain LEFT, MID, TAIL or RIGHT",
call. = FALSE
)
}
if (!is.vector(type)) {
type <- as.vector(type)
warning("Type should be a vector of strings", call. = FALSE)
} else if (!length(type) %in% c(1, nrow(patterns), nrow(patterns.new))) {
type <- type[1]
warning("Type should either have length 1 or length equal to #patterns, first element is used for all patterns", call. = FALSE)
}
if (mech != "MCAR" && !is.null(odds) && !is.matrix(odds)) {
if (nrow(patterns.new) == 1 && is.vector(odds)) {
odds <- matrix(odds, nrow = 1)
} else {
stop("Odds matrix should be a matrix", call. = FALSE)
}
}
if (is.null(odds)) {
odds <- ampute.default.odds(patterns = patterns.new)
}
if (!cont) {
for (h in seq_len(nrow(odds))) {
if (any(!is.na(odds[h, ]) & odds[h, ] < 0)) {
stop("Odds matrix can only have positive values", call. = FALSE)
}
}
}
if (!nrow(odds) %in% c(nrow(patterns), nrow(patterns.new))) {
stop("The objects patterns and odds are not matching", call. = FALSE)
}
#
# Start using arguments
# Create empty objects
P <- NULL
scores <- NULL
missing.data <- NULL
# Apply function (run = TRUE) or merely return objects (run = FALSE)
if (run) {
# Assign cases to the patterns according probs
# Because 0 and 1 will be used for missingness,
# the numbering of the patterns will start from 2
P <- sample.int(
n = nrow(patterns.new), size = nrow(data),
replace = TRUE, prob = freq
) + 1
# Check whether cases are assigned to all patterns
non.used.patterns <- c(2:(nrow(patterns.new) + 1))[!c(2:(nrow(patterns.new) + 1)) %in% unique(P)]
if (length(non.used.patterns) > 0) {
warning(paste0("No records are assigned to patterns ", toString(non.used.patterns - 1), ". These patterns will not be generated. Consider reducing the number of patterns or increasing the dataset size."), call. = FALSE)
}
# Calculate missingness according MCAR or calculate weighted sum scores
# Standardized data is used to calculate weighted sum scores
if (mech == "MCAR") {
R <- ampute.mcar(
P = P,
patterns = patterns.new,
prop = prop
)
} else {
scores <- sumscores(
P = P,
data = data,
std = std,
weights = weights,
patterns = patterns
)
if (!cont) {
R <- ampute.discrete(
P = P,
scores = scores,
odds = odds,
prop = prop
)
} else if (cont) {
R <- ampute.continuous(
P = P,
scores = scores,
prop = round(prop, 3),
type = type
)
}
}
missing.data <- data
for (i in seq_len(nrow(patterns.new))) {
if (any(P == (i + 1))) {
missing.data[R[[i]] == 0, patterns.new[i, ] == 0] <- NA
}
}
}
# Create return object
names(patterns.new) <- names(data)
names(weights) <- names(data)
call <- match.call()
data.in[is.na(data.frame(missing.data))] <- NA
result <- mads(
call = call,
prop = prop,
patterns = patterns.new,
freq = freq,
mech = mech,
weights = weights,
cont = cont,
type = type,
odds = odds,
amp = data.in,
cand = P - 1,
scores = scores,
data = as.data.frame(data))
return(result)
}
# This is an underlying function of multivariate amputation function ampute().
# This function is used to calculate the weighted sum scores of the candidates.
# Based on the data, the weights matrix and the kind of mechanism, each case
# will obtain a certain score that will define his probability to be made missing.
# The calculation of the probabilities occur in the function ampute.mcar(),
# ampute.continuous() or ampute.discrete(), based on the kind of missingness.
sumscores <- function(P, data, std, weights, patterns) {
weights <- as.matrix(weights)
f <- function(i) {
if (length(P[P == (i + 1)]) == 0) {
return(0) # this will ensure length 1 which is used in ampute.continuous
} else {
candidates <- as.matrix(data[P == (i + 1), ])
# For each candidate in the pattern, a weighted sum score is calculated
if (std) {
length_unique <- function(x) {
return(length(unique(x)) == 1)
}
# shangzhi-hong, Feb 2020, #216
if (nrow(candidates) > 1 && !(any(apply(candidates, 2, length_unique)))) {
candidates <- scale(candidates)
}
}
scores <- apply(candidates, 1, function(x) weights[i, ] %*% x)
if (length(scores) > 1 && length(unique(scores)) != 1) {
scores <- scale(scores)
}
return(scores)
}
}
lapply(seq_len(nrow(patterns)), f)
}
# This is an underlying function of multivariate amputation function ampute().
# The function recalculates the proportion of missing cases for the desired
# #missing cells.
recalculate.prop <- function(prop, n, k, patterns, freq) {
miss <- prop * n * k # Desired #missing cells
# Calculate #cases according prop and #zeros in patterns
cases <- vapply(
seq_len(nrow(patterns)),
function(i) (miss * freq[i]) / length(patterns[i, ][patterns[i, ] == 0]),
numeric(1)
)
if (sum(cases) > n) {
stop("Proportion of missing cells is too large in combination with the desired number of missing variables",
call. = FALSE
)
} else {
prop <- sum(cases) / n
}
prop
}
# This is an underlying function of multivariate amputation function ampute().
# The function recalculates the frequency vector to make the sum equal to 1.
recalculate.freq <- function(freq) {
freq / sum(freq)
}
# This is an underlying function of multivariate amputation function ampute().
# The function checks whether there are patterns with merely ones or zeroos.
# In case of the first, these patterns will be removed, and argument prop
# and freq will be changed. In case there is a pattern with merely zeroos,
# this is ascertained and saved in the object row.zero.
check.patterns <- function(patterns, freq, prop) {
prop.one <- 0
row.one <- c()
for (h in seq_len(nrow(patterns))) {
if (any(!patterns[h, ] %in% c(0, 1))) {
stop(paste("Argument patterns can only contain 0 and 1, pattern", h, "contains another element"), call. = FALSE)
}
if (all(patterns[h, ] %in% 1)) {
prop.one <- prop.one + freq[h]
row.one <- c(row.one, h)
}
}
if (prop.one != 0) {
warning(paste("Proportion of missingness has changed from", prop, "to", (1 - prop.one) * prop, "because of pattern(s) with merely ones"), call. = FALSE)
prop <- (1 - prop.one) * prop
freq <- freq[-row.one]
freq <- recalculate.freq(freq)
patterns <- patterns[-row.one, ]
warning("Frequency vector and patterns matrix have changed because of pattern(s) with merely ones", call. = FALSE)
}
prop.zero <- 0
row.zero <- c()
for (h in seq_len(nrow(patterns))) {
if (all(patterns[h, ] %in% 0)) {
prop.zero <- prop.zero + freq[h]
row.zero <- c(row.zero, h)
}
}
objects <- list(
patterns = patterns,
prop = prop,
freq = freq,
row.one = row.one,
row.zero = row.zero
)
objects
}
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.