delete_MCAR: Create MCAR values

Description Usage Arguments Details Value References Examples

View source: R/delete_MCAR.R

Description

Create missing completely at random (MCAR) values in a data frame or a matrix

Usage

1
2
3
4
5
6
7
8
delete_MCAR(
  ds,
  p,
  cols_mis = seq_len(ncol(ds)),
  stochastic = FALSE,
  p_overall = FALSE,
  miss_cols
)

Arguments

ds

A data frame or matrix in which missing values will be created.

p

A numeric vector with length one or equal to length cols_mis; the probability that a value is missing.

cols_mis

A vector of column names or indices of columns in which missing values will be created.

stochastic

Logical; see details.

p_overall

Logical; see details.

miss_cols

Deprecated, use cols_mis instead.

Details

This function creates missing completely at random (MCAR) values in the dataset ds. The proportion of missing values is specified with p. The columns in which missing values are created can be set via cols_mis. If cols_mis is not specified, then missing values are created in every column.

The probability for missing values is controlled by p. If p is a single number, then the overall probability for a value to be missing will be p in all columns of cols_mis. (Internally p will be replicated to a vector of the same length as cols_mis. So, all p[i] in the following sections will be equal to the given single number p.) Otherwise, p must be of the same length as cols_mis. In this case, the overall probability for a value to be missing will be p[i] in the column cols_mis[i].

If stochastic = FALSE and p_overall = FALSE (the default), then exactly round(nrow(ds) * p[i]) values will be set NA in column cols_mis[i]. If stochastic = FALSE and p_overall = TRUE, then p must be of length one and exactly round(nrow(ds) * p * length(cols_mis)) values will be set NA (over all columns in cols_mis). This can result in a proportion of missing values in every miss_col unequal to p, but the proportion of missing values in all columns together will be close to p.

If stochastic = TRUE, then each value in column cols_mis[i] has the probability p[i] to be missing. In this case, the number of missing values in cols_mis[i] is a random variable with a binomial distribution B(nrow(ds), p[i]). This can (and will most of the time) lead to more or less missing values than round(nrow(ds) * p[i]) in each column. If stochastic = TRUE, then the argument p_overall is ignored because it is superfluous.

Value

An object of the same class as ds with missing values.

References

Santos, M. S., Pereira, R. C., Costa, A. F., Soares, J. P., Santos, J., & Abreu, P. H. (2019). Generating Synthetic Missing Data: A Review by Missing Mechanism. IEEE Access, 7, 11651-11667

Examples

1
2
ds <- data.frame(X = 1:20, Y = 101:120)
delete_MCAR(ds, 0.2)

Example output

    X   Y
1   1  NA
2  NA 102
3   3 103
4   4 104
5   5 105
6   6 106
7  NA 107
8  NA  NA
9   9 109
10 10 110
11 11 111
12 12 112
13 13  NA
14 14 114
15 15 115
16 NA 116
17 17 117
18 18  NA
19 19 119
20 20 120

missMethods documentation built on July 30, 2020, 5:13 p.m.