library(MissImp)
In the first section, some data are generated. first, a compete mixed-type dataset is drawn with 5 continuous variables, 1 integer variable and 2 categorical variables. Then given the desired missing data proportion and the missing data mechanism, based on the complete dataset, an incomplete dataset with missing values is created.
In the second section of this document "Statistical tests for MCAR", three tests are performed to show if an incomplete dataset follows a Missing Completely At Random (MCAR) mechanism. These three tests are independently used on several generated incomplete datasets with different mechanisms, in order to show their capacity.
We generate a complete data set $(Y_j)_{1 \leq j \leq 8}$ with $(Y_1, Y_2, Y_3) \sim \mathcal{N}(\mu_1, \Sigma_1)$, $(Y_4, Y_5) \sim \mathcal{N}(\mu_2, \Sigma_2)$, $Y_6 \sim P(\lambda)$ and $Y_7, Y_8 \sim \text{Bin}(\gamma)$.
n <- 10000 complete_df_generator <- function(n) { mu.X <- c(1, 2, 3) Sigma.X <- matrix(c( 9, 3, 2, 3, 4, 0, 2, 0, 1 ), nrow = 3) # multivariate normal distribution X.complete.cont <- MASS::mvrnorm(n, mu.X, Sigma.X) mu1.X <- c(9, 8) Sigma1.X <- matrix(c( 16, 14, 14, 25 ), nrow = 2) # multivariate normal distribution X.complete.cont1 <- MASS::mvrnorm(n, mu1.X, Sigma1.X) lambda <- 4.3 # poisson distribution X.complete.discr <- stats::rpois(n, lambda) # binomial X.complete.cat <- stats::rbinom(n, size = 5, prob = 0.4) # binomial X.complete.cat2 <- stats::rbinom(n, size = 7, prob = 0.6) X.complete <- data.frame(cbind(X.complete.cont, X.complete.cont1, X.complete.discr, ... = X.complete.cat, X.complete.cat2 )) X.complete[, 7] <- as.factor(X.complete[, 7]) levels(X.complete[, 7]) <- c("F", "E", "D", "C", "B", "A") X.complete[, 8] <- as.factor(X.complete[, 8]) colnames(X.complete) <- c("Y1", "Y2", "Y3", "Y4", "Y5", "Y6", "Y7", "Y8") return(X.complete) } X.complete <- complete_df_generator(n)
We could generate incomplete dataframe from the complete dataframe, with a chosen percentage of missing data and a chosen missing mechanism. The difference between mechanisms is explained in the documentation of generate_miss
.
miss_perc <- 0.4 rs <- generate_miss(X.complete, miss_perc, mechanism = "MNAR1") head(rs$X.incomp) head(rs$R.mask) print(rs$real_miss_perc)
Next, a list of incomplete dataframes is generated with every mechanism that is implemented in generate_miss
. First a list of incomplete dataframes is generated on the continuous variables $(Y_1, \dots, Y_5)$, with the result in rs.con
; then on the continous + discrete variables $(Y_1, \dots, Y_6)$, with the result in rs.con_dis
; and at last on continuous + discrete + categorical variables, with the result in rs.mix
.
miss_perc <- 0.3 rs.con <- generate_miss_ls(X.complete[, 1:5], miss_perc) rs.con_dis <- generate_miss_ls(X.complete[, 1:6], miss_perc) rs.mix <- generate_miss_ls(X.complete, miss_perc)
list_df.con <- list(rs.con$mcar$X.incomp, rs.con$mar1$X.incomp, rs.con$mar2$X.incomp, rs.con$mar3$X.incomp, rs.con$mnar1$X.incomp, rs.con$mnar2$X.incomp) list_df.con_dis <- list(rs.con_dis$mcar$X.incomp, rs.con_dis$mar1$X.incomp, rs.con_dis$mar2$X.incomp, rs.con_dis$mar3$X.incomp, rs.con_dis$mnar1$X.incomp, rs.con_dis$mnar2$X.incomp) list_df.mix <- list(rs.mix$mcar$X.incomp, rs.mix$mar1$X.incomp, rs.mix$mar2$X.incomp, rs.mix$mar3$X.incomp, rs.mix$mnar1$X.incomp, rs.mix$mnar2$X.incomp)
p_val <- 0.05
After the generation of incomplete datasets, we test if the missing data mechanism is MCAR by performing three classical statistic tests: the dummy variable test \code{dummy_test}, Little's MCAR test \code{mcar_test} from package \code{naniar} and the normality tests (Hawkin's test and non-parametric test included) \code{TestMCARNormality} from package \code{MissMech}. Our null hypothesis H0 is that the missing data mechanism is MCAR. As the normality test is only written for numerical data, here the test is performed on only the numerical part of the input dataframe.
t <- mcar_test_combined(list_df.mix[[5]], col_cat = c(7, 8), p_val = p_val) print(t$test_results)
# A dataframe that records the p-values for each test test_p_vals <- data.frame(matrix(rep(0, 4), nrow = 1)) colnames(test_p_vals) <- c("Dummy.variable.test", "Little.s.MCAR.test", "Hawkin.s.test", "Non.parametric.test") ls_test <- list("MCAR", "MAR1", "MAR2", "MAR3", "MNAR1", "MNAR2") i <- 1 t <- 1 col_cat <- c(7:8) for (ls in list(list_df.con, list_df.con_dis, list_df.mix)) { for (df in ls) { if (i == 3) { out <- suppressWarnings(mcar_test_combined(df, col_cat, p_val)) } else { out <- suppressWarnings(mcar_test_combined(df, c(), p_val)) } test_p_vals[t, ] <- out$p_values[1, ] t <- t + 1 } i <- i + 1 } test_p_vals[["type"]] <- c(rep("continuous", 6), rep("continuous+discrete", 6), rep("mix", 6)) test_p_vals[["mechanism"]] <- rep(c("MCAR", "MAR1", "MAR2", "MAR3", "MNAR1", "MNAR2"), 3) test_results <- data.frame(test_p_vals[, c(1, 2)] > p_val) test_results[["Miss_Mech_test"]] <- c((test_p_vals[["Hawkin.s.test"]] < p_val) * (test_p_vals[["Non.parametric.test"]] < p_val) == 0) test_p_vals <- test_p_vals[c("type", "mechanism", "Dummy.variable.test", "Little.s.MCAR.test", "Hawkin.s.test", "Non.parametric.test")] test_p_vals test_results
The result of MissMech test is not very accurate. This may due to the problem of implementation in MissMech package, which is not available on CRAN anymore.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.