Description Usage Arguments Details Value Author(s) References See Also Examples
This function generates multivariate missing data in a MCAR, MAR or MNAR manner.
Imputation of data sets containing missing values can be performed with
mice
.
1 2 3 
data 
A complete data matrix or dataframe. Values should be numeric. Categorical variables should have been transformed into dummies. 
prop 
A scalar specifying the proportion of missingness. Should be a value between 0 and 1. Default is a missingness proportion of 0.5. 
patterns 
A matrix or data frame of size #patterns by #variables where

freq 
A vector of length #patterns containing the relative frequency with
which the patterns should occur. For example, for three missing data patterns,
the vector could be 
mech 
A string specifying the missingness mechanism, either MCAR (Missing Completely At Random), MAR (Missing At Random) or MNAR (Missing Not At Random). Default is a MAR missingness mechanism. 
weights 
A matrix or data frame of size #patterns by #variables. The matrix
contains the weights that will be used to calculate the weighted sum scores. For
a MAR mechanism, weights of the variables that will be made incomplete, should be
zero. For a MNAR mechanism, these weights might have any possible value. Furthermore,
the weights may differ between patterns and between variables. They may be negative
as well. Within each pattern, the relative size of the values are of importance.
The default weights matrix is made with 
std 
Logical. Whether the weighted sum scores should be calculated with standardized data or with nonstandardized data. The latter is advised when making use of train and testsets in order to prevent leakage. 
cont 
Logical. Whether the probabilities should be based on a continuous
or discrete distribution. If TRUE, the probabilities of being missing are based
on a continuous logistic distribution function. 
type 
A vector of strings containing the type of missingness for each
pattern. Either 
odds 
A matrix where #patterns defines the #rows. Each row should contain
the odds of being missing for the corresponding pattern. The amount of odds values
defines in how many quantiles the sum scores will be divided. The values are
relative probabilities: a quantile with odds value 4 will have a probability of
being missing that is four times higher than a quantile with odds 1. The
#quantiles may differ between the patterns, specify NA for cells remaining empty.
Default is 4 quantiles with odds values 1, 2, 3 and 4, the result of

bycases 
Logical. If TRUE, the proportion of missingness is defined in terms of cases. If FALSE, the proportion of missingness is defined in terms of cells. Default is TRUE. 
run 
Logical. If TRUE, the amputations are implemented. If FALSE, the return object will contain everything but the amputed data set. 
When new multiple imputation techniques are tested, missing values need to be
generated in simulated data sets. The generation of missing values is what
we call: amputation. The function ampute
is developed to perform any kind
of amputation desired by the researcher. An extensive example and more explanation
of the function can be found in the vignette Generate missing values with
ampute, available in mice as well. For imputation, the function
mice
is advised.
Until recently, univariate amputation procedures were used to generate missing data in complete, simulated data sets. With this approach, variables are made incomplete one variable at a time. When several variables need to be amputed, the procedure is repeated multiple times.
With this univariate approach, it is difficult to relate the missingness on one
variable to the missingness on another variable. A multivariate amputation procedure
solves this issue and moreover, it does justice to the multivariate nature of
data sets. Hence, ampute
is developed to perform the amputation according
the researcher's desires.
The idea behind the function is the specification of several missingness
patterns. Each pattern is a combination of variables with and without missing
values (denoted by 0
and 1
respectively). For example, one might
want to create two missingness patterns on a data set with four variables. The
patterns could be something like: 0, 0, 1, 1
and 1, 0, 1, 0
.
Each combination of zeros and ones may occur.
Furthermore, the researcher specifies the proportion of missingness, either the proportion of missing cases or the proportion of missing cells, and the relative frequency each pattern occurs. Consequently, the data is divided over the patterns with these probabilities. Now, each case is candidate for a certain missingness pattern, but whether the case will have missing values eventually, depends on other specifications.
The first of these specifications is the missing mechanism. There are three possible mechanisms: the missingness depends completely on chance (MCAR), the missingness depends on the values of the observed variables (i.e. the variables that remain complete) (MAR) or on the values of the variables that will be made incomplete (MNAR). For a more thorough explanation of these definitions, I refer to Van Buuren (2012).
When the user sets the missingness mechanism to "MCAR"
, the candidates
have an equal probability of having missing values. No other specifications
have to be made. For a "MAR"
or "MNAR"
mechanism, weighted sum
scores are calculated. These scores are a linear combination of the
variables.
In order to calculate the weighted sum scores, the data is standardized. That
is the reason the data has to be numeric. Second, for each case, the values in
the data set are multiplied with the weights, specified by argument weights
.
These weighted scores will be summed, resulting in a weighted sum score for each case.
The weights may differ between patterns and they may be negative or zero as well.
Naturally, in case of a MAR
mechanism, the weights corresponding to the
variables that will be made incomplete, have a 0
. Note that this might be
different for each pattern. In case of MNAR
missingness, especially
the weights of the variables that will be made incomplete are of importance. However,
the other variables might be weighted as well.
It is the relative difference between the weights that will result in an effect in the sum scores. For example, for the first missing data pattern mentioned above, the weights for the third and fourth variables might be set to 2 and 4. However, weight values of 0.2 and 0.4 will have the exact same effect on the weighted sum score: the fourth variable is weighted twice as much as variable 3.
Based on the weighted sum scores, either a discrete or continuous distribution of probabilities is used to calculate whether a candidate will have missing values.
For a discrete distribution of probabilities, the weighted sum scores are divided into subgroups of equal size (quantiles). Thereafter, the user specifies for each subgroup the odds of being missing. Both the number of subgroups and the odds values are important for the generation of missing data. For example, for a RIGHTlike mechanism, scoring in one of the higher quantiles should have high missingness odds, whereas for a MIDlike mechanism, the central groups should have higher odds. Again, not the size of the odds values are of importance, but the relative distance between the values.
The continuous distributions of probabilities are based on the logit function, as described by Van Buuren (2012). The user can specify the type of missingness, which, again, may differ between patterns.
For an extensive example of the working of the function, I gladly refer to the vignette Generate missing values with ampute.
Returns an S3 object of class madsclass
(multivariate
amputed data set)
Rianne Schouten [aut, cre], Gerko Vink [aut], Peter Lugtig [ctb], 2016
Brand, J.P.L. (1999). Development, implementation and evaluation of multiple imputation strategies for the statistical analysis of incomplete data sets (pp. 110113). Dissertation. Rotterdam: Erasmus University.
Van Buuren, S., Brand, J.P.L., GroothuisOudshoorn, C.G.M., Rubin, D.B. (2006). Fully conditional specification in multivariate imputation. Journal of Statistical Computation and Simulation, 76(12), Appendix B.
Van Buuren, S. (2018). Flexible Imputation of Missing Data. Second Edition. Boca Raton, FL.: Chapman & Hall/CRC Press.
Vink, G. (2016). Towards a standardized evaluation of multiple imputation routines.
madsclass
, bwplot
, xyplot
,
mice
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17  # Simulate data set with \code{mvrnorm} from package \code{\pkg{MASS}}.
require(MASS)
sigma < matrix(data = c(1, 0.2, 0.2, 0.2, 1, 0.2, 0.2, 0.2, 1), nrow = 3)
complete.data < mvrnorm(n = 100, mu = c(5, 5, 5), Sigma = sigma)
# Perform quick amputation
result1 < ampute(data = complete.data)
# Change default matrices as desired
patterns < result1$patterns
patterns[1:3, 2] < 0
odds < result1$odds
odds[2,3:4] < c(2, 4)
odds[3,] < c(3, 1, NA, NA)
# Rerun amputation
result2 < ampute(data = complete.data, patterns = patterns, freq =
c(0.3, 0.3, 0.4), cont = FALSE, odds = odds)
# Run an amputation procedure with continuous probabilities
result3 < ampute(data = complete.data, type = c("RIGHT", "TAIL", "LEFT"))

Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.