In adimajo/MissImp: Missing Data Imputation

library(MissImp)

In the first section, some data are generated. first, a compete mixed-type dataset is drawn with 5 continuous variables, 1 integer variable and 2 categorical variables. Then given the desired missing data proportion and the missing data mechanism, based on the complete dataset, an incomplete dataset with missing values is created.

In the second section of this document "Statistical tests for MCAR", three tests are performed to show if an incomplete dataset follows a Missing Completely At Random (MCAR) mechanism. These three tests are independently used on several generated incomplete datasets with different mechanisms, in order to show their capacity.

Generate data {#generate}

Complete data {#complete}

We generate a complete data set $(Y_j)_{1 \leq j \leq 8}$ with $(Y_1, Y_2, Y_3) \sim \mathcal{N}(\mu_1, \Sigma_1)$, $(Y_4, Y_5) \sim \mathcal{N}(\mu_2, \Sigma_2)$, $Y_6 \sim P(\lambda)$ and $Y_7, Y_8 \sim \text{Bin}(\gamma)$.

n <- 10000
complete_df_generator <- function(n) {
  mu.X <- c(1, 2, 3)
  Sigma.X <- matrix(c(
    9, 3, 2,
    3, 4, 0,
    2, 0, 1
  ), nrow = 3)
  # multivariate normal distribution
  X.complete.cont <- MASS::mvrnorm(n, mu.X, Sigma.X)

  mu1.X <- c(9, 8)
  Sigma1.X <- matrix(c(
    16, 14,
    14, 25
  ), nrow = 2)
  # multivariate normal distribution
  X.complete.cont1 <- MASS::mvrnorm(n, mu1.X, Sigma1.X)

  lambda <- 4.3
  # poisson distribution
  X.complete.discr <- stats::rpois(n, lambda)
  # binomial
  X.complete.cat <- stats::rbinom(n, size = 5, prob = 0.4)
  # binomial
  X.complete.cat2 <- stats::rbinom(n, size = 7, prob = 0.6)

  X.complete <- data.frame(cbind(X.complete.cont, X.complete.cont1,
    X.complete.discr,
    ... = X.complete.cat, X.complete.cat2
  ))
  X.complete[, 7] <- as.factor(X.complete[, 7])
  levels(X.complete[, 7]) <- c("F", "E", "D", "C", "B", "A")
  X.complete[, 8] <- as.factor(X.complete[, 8])
  colnames(X.complete) <- c("Y1", "Y2", "Y3", "Y4", "Y5", "Y6", "Y7", "Y8")
  return(X.complete)
}
X.complete <- complete_df_generator(n)

Add missingness {#missingness}

We could generate incomplete dataframe from the complete dataframe, with a chosen percentage of missing data and a chosen missing mechanism. The difference between mechanisms is explained in the documentation of generate_miss.

miss_perc <- 0.4
rs <- generate_miss(X.complete, miss_perc, mechanism = "MNAR1")
head(rs$X.incomp)
head(rs$R.mask)
print(rs$real_miss_perc)

Next, a list of incomplete dataframes is generated with every mechanism that is implemented in generate_miss. First a list of incomplete dataframes is generated on the continuous variables $(Y_1, \dots, Y_5)$, with the result in rs.con; then on the continous + discrete variables $(Y_1, \dots, Y_6)$, with the result in rs.con_dis; and at last on continuous + discrete + categorical variables, with the result in rs.mix.

miss_perc <- 0.3
rs.con <- generate_miss_ls(X.complete[, 1:5], miss_perc)
rs.con_dis <- generate_miss_ls(X.complete[, 1:6], miss_perc)
rs.mix <- generate_miss_ls(X.complete, miss_perc)

list_df.con <- list(rs.con$mcar$X.incomp, rs.con$mar1$X.incomp, rs.con$mar2$X.incomp, rs.con$mar3$X.incomp, rs.con$mnar1$X.incomp, rs.con$mnar2$X.incomp)

list_df.con_dis <- list(rs.con_dis$mcar$X.incomp, rs.con_dis$mar1$X.incomp, rs.con_dis$mar2$X.incomp, rs.con_dis$mar3$X.incomp, rs.con_dis$mnar1$X.incomp, rs.con_dis$mnar2$X.incomp)

list_df.mix <- list(rs.mix$mcar$X.incomp, rs.mix$mar1$X.incomp, rs.mix$mar2$X.incomp, rs.mix$mar3$X.incomp, rs.mix$mnar1$X.incomp, rs.mix$mnar2$X.incomp)

Statistical tests for MCAR {#tests}

p_val <- 0.05

After the generation of incomplete datasets, we test if the missing data mechanism is MCAR by performing three classical statistic tests: the dummy variable test \code{dummy_test}, Little's MCAR test \code{mcar_test} from package \code{naniar} and the normality tests (Hawkin's test and non-parametric test included) \code{TestMCARNormality} from package \code{MissMech}. Our null hypothesis H0 is that the missing data mechanism is MCAR. As the normality test is only written for numerical data, here the test is performed on only the numerical part of the input dataframe.

t <- mcar_test_combined(list_df.mix[[5]], col_cat = c(7, 8), p_val = p_val)
print(t$test_results)

# A dataframe that records the p-values for each test
test_p_vals <- data.frame(matrix(rep(0, 4), nrow = 1))
colnames(test_p_vals) <- c("Dummy.variable.test", "Little.s.MCAR.test", "Hawkin.s.test", "Non.parametric.test")

ls_test <- list("MCAR", "MAR1", "MAR2", "MAR3", "MNAR1", "MNAR2")
i <- 1
t <- 1
col_cat <- c(7:8)
for (ls in list(list_df.con, list_df.con_dis, list_df.mix)) {
  for (df in ls) {
    if (i == 3) {
      out <- suppressWarnings(mcar_test_combined(df, col_cat, p_val))
    }
    else {
      out <- suppressWarnings(mcar_test_combined(df, c(), p_val))
    }
    test_p_vals[t, ] <- out$p_values[1, ]
    t <- t + 1
  }
  i <- i + 1
}
test_p_vals[["type"]] <- c(rep("continuous", 6), rep("continuous+discrete", 6), rep("mix", 6))
test_p_vals[["mechanism"]] <- rep(c("MCAR", "MAR1", "MAR2", "MAR3", "MNAR1", "MNAR2"), 3)
test_results <- data.frame(test_p_vals[, c(1, 2)] > p_val)
test_results[["Miss_Mech_test"]] <- c((test_p_vals[["Hawkin.s.test"]] < p_val) * (test_p_vals[["Non.parametric.test"]] < p_val) == 0)
test_p_vals <- test_p_vals[c("type", "mechanism", "Dummy.variable.test", "Little.s.MCAR.test", "Hawkin.s.test", "Non.parametric.test")]

test_p_vals
test_results

The result of MissMech test is not very accurate. This may due to the problem of implementation in MissMech package, which is not available on CRAN anymore.

adimajo/MissImp documentation built on April 5, 2022, 6:32 a.m.

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

Tweet to @rdrrHQ

GitHub issue tracker

ian@mutexlabs.com