View source: R/delete_one_group.R
delete_MAR_one_group | R Documentation |
Create missing at random (MAR) values by deleting values in one of two groups in a data frame or a matrix
delete_MAR_one_group( ds, p, cols_mis, cols_ctrl, cutoff_fun = median, prop = 0.5, use_lpSolve = TRUE, ordered_as_unordered = FALSE, n_mis_stochastic = FALSE, ..., miss_cols, ctrl_cols, stochastic )
ds |
A data frame or matrix in which missing values will be created. |
p |
A numeric vector with length one or equal to length |
cols_mis |
A vector of column names or indices of columns in which missing values will be created. |
cols_ctrl |
A vector of column names or indices of columns, which
controls the creation of missing values in |
cutoff_fun |
Function that calculates the cutoff values in the
|
prop |
Numeric of length one; (minimum) proportion of rows in group 1 (only used for unordered factors). |
use_lpSolve |
Logical; should lpSolve be used for the determination of
groups, if |
ordered_as_unordered |
Logical; should ordered factors be treated as unordered factors. |
n_mis_stochastic |
Logical, should the number of missing values be
stochastic? If |
... |
Further arguments passed to |
miss_cols |
Deprecated, use |
ctrl_cols |
Deprecated, use |
stochastic |
Deprecated, use |
This function creates missing at random (MAR) values in the columns
specified by the argument cols_mis
.
The probability for missing values is controlled by p
.
If p
is a single number, then the overall probability for a value to
be missing will be p
in all columns of cols_mis
.
(Internally p
will be replicated to a vector of the same length as
cols_mis
.
So, all p[i]
in the following sections will be equal to the given
single number p
.)
Otherwise, p
must be of the same length as cols_mis
.
In this case, the overall probability for a value to be missing will be
p[i]
in the column cols_mis[i]
.
The position of the missing values in cols_mis[i]
is controlled by
cols_ctrl[i]
.
The following procedure is applied for each pair of cols_ctrl[i]
and
cols_mis[i]
to determine the positions of missing values:
At first, the rows of ds
are divided into two groups.
Therefore, the cutoff_fun
calculates a cutoff value for
cols_ctrl[i]
(via cutoff_fun(ds[, cols_ctrl[i]], ...)
.
The group 1 consists of the rows, whose values in
cols_ctrl[i]
are below the calculated cutoff value.
If the so defined group 1 is empty, the rows that are equal to the
cutoff value will be added to this group (otherwise, these rows will
belong to group 2).
The group 2 consists of the remaining rows, which are not part of group 1.
Now one of these two groups is chosen randomly.
In the chosen group, values are deleted in cols_mis[i]
.
In the other group, no missing values will be created in cols_mis[i]
.
If p
is too high, it is possible that a group contains not enough
objects to reach nrow(ds) * p
missing values. In this case, p
is reduced to the maximum possible value (given the (random) group with
missing data) and a warning is given. Obviously this case will occur
regularly, if p > 0.5
. Therefore, this function should normally not be
called with p > 0.5
. However, this can occur for smaller values
of p
, too (depending on the grouping). The warning can be silenced by
setting the option missMethods.warn.too.high.p
to false.
An object of the same class as ds
with missing values.
If ds[, cols_ctrl[i]]
is an unordered factor, then the concept of a
cutoff value is not meaningful and cannot be applied.
Instead, a combinations of the levels of the unordered factor is searched that
guarantees at least a proportion of prop
rows are in group 1
minimize the difference between prop
and the proportion of
rows in group 1.
This can be seen as a binary search problem, which is solved by the solver
from the package lpSolve
, if use_lpSolve = TRUE
.
If use_lpSolve = FALSE
, a very simple heuristic is applied.
The heuristic only guarantees that at least a proportion of prop
rows
are in group 1.
The choice use_lpSolve = FALSE
is not recommend and should only be
considered, if the solver of lpSolve fails.
If ordered_as_unordered = TRUE
, then ordered factors will be treated
like unordered factors and the same binary search problem will be solved for
both types of factors.
If ordered_as_unordered = FALSE
(the default), then ordered factors
will be grouped via cutoff_fun
as described in Details.
Santos, M. S., Pereira, R. C., Costa, A. F., Soares, J. P., Santos, J., & Abreu, P. H. (2019). Generating Synthetic Missing Data: A Review by Missing Mechanism. IEEE Access, 7, 11651-11667
delete_MNAR_one_group
Other functions to create MAR:
delete_MAR_1_to_x()
,
delete_MAR_censoring()
,
delete_MAR_rank()
ds <- data.frame(X = 1:20, Y = 101:120) delete_MAR_one_group(ds, 0.2, "X", "Y")
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.