knitr::opts_chunk$set( collapse = TRUE, comment = "#>", dpi = 150, fig.width = 6, fig.height = 4.5 )
library(smdi) library(gt) suppressPackageStartupMessages(library(dplyr))
smdi
In this article, we want to briefly highlight two aspect regarding multivariate missingness:
How does smdi
handle multivariate missingness?
What is the link between missing data patterns and missing data mechanisms and how does this affect the behavior and performance of the smdi
functionality?
In general, there are two classic established missing data taxonomies:
Mechanisms: Missing completely at random (MCAR), at random (MAR) and not at random (MNAR)
Patterns: Monotone versus Non-monotone missingness
smdi
handle multivariate missingness?In all smdi
functions, except smdi_little()
, binary missing indicator variables are created for each partially observed variable (either specified by the analyst using the covar
parameter or automatically identified via smdi_check_covar()
if covar
= NULL) and the columns with the actual variable values are dropped. For the variable importance visualization in smdi_rf()
, these variables are indicated with a "_NA" suffix. Missing values are accordingly indicated with a 1 and complete observations with a 0. This functionality is controlled via the smdi_na_indicator()
utility function.
smdi_data %>% smdi_na_indicator( drop_NA_col = FALSE # usually TRUE, but for demonstration purposes set to FALSE ) %>% select( ecog_cat, ecog_cat_NA, egfr_cat, egfr_cat_NA, pdl1_num, pdl1_num_NA ) %>% head() %>% gt()
Now, let's assume we have three partially observed covariates X, Y and Z, which we would like to include in our missingness diagnostics. All smdi_diagnose()
functions, except smdi_little()
, create X_NA, Y_NA and Z_NA and X, Y and Z are discarded from the dataset. The functions will then iterate the diagnostics through all X_NA, Y_NA and Z_NA one-by-one. That is, if, for example, X_NA is assessed, Y_NA and Z_NA serve as predictor variables along with all other covariates in the dataset. If Y_NA is assessed, X_NA and Z_NA are included as predictor variables, and so forth.
smdi
in case of monotone missing data patternsWhile in the smdi
package we mainly focus on missing data mechanisms, missing data patterns always need to be considered, too. Please refer to the routine structural missing data diagnostics article, where we highlight the importance of describing missingness proportions and patterns before running any of the smdi
diagnostics.
As mentioned in the section before, in case of monotone missing data patterns, the smdi
functionality may be misleading.
[^1]: For more information on missing data patterns see https://stefvanbuuren.name/fimd/missing-data-pattern.html
A good example for monotone missing data could be clinical blood laboratory tests ("labs") which are often tested together in a lab panel. If one lab is missing, typically the other labs of this panel are also missing.
# we simulatea monotone missingness pattern # following an MCAR mechanism set.seed(42) data_monotone <- smdi_data_complete %>% mutate( lab1 = rnorm(nrow(smdi_data_complete), mean = 5, sd = 0.5), lab2 = rnorm(nrow(smdi_data_complete), mean = 10, sd = 2.25) ) data_monotone[3:503, "lab1"] <- NA data_monotone[1:500, "lab2"] <- NA
smdi::gg_miss_upset(data = data_monotone)
smdi::md.pattern(data_monotone[, c("lab1", "lab2")], plot = FALSE)
In extreme cases of perfect linearity, this can lead to multiple warnings and errors such as system is exactly singular
or -InfWarning: Variable has only NA's in at least one stratum
.
In cases in which monotonicity is still clearly present but not as extreme (like in the example above), smdi
will prompt a message to the analyst to raise awareness of this issue as the smdi
output can be highly misleading in such instances.
diagnostics_jointly <- smdi_diagnose( data = data_monotone, covar = NULL, # NULL includes all covariates with at least one NA model = "cox", form_lhs = "Surv(eventtime, status)" )
diagnostics_jointly %>% smdi_style_gt()
In such cases, it may be advisable to not consider including lab2 in the missingness diagnostics of lab1 and vice versa and stack the diagnostics focusing on one partially observed variable at a time.
# lab 1 lab1_diagnostics <- smdi_diagnose( data = data_monotone %>% select(-lab2), model = "cox", form_lhs = "Surv(eventtime, status)" ) lab1_diagnostics %>% smdi_style_gt()
# lab 2 lab2_diagnostics <- smdi_diagnose( data = data_monotone %>% select(-lab1), model = "cox", form_lhs = "Surv(eventtime, status)" ) lab2_diagnostics %>% smdi_style_gt()
smdi_style_gt()
We can also combine the output of individually stacked smdi_diagnose
tables and enhance it with a global Little's test that takes into account the multivariate missingness of the entire dataset.
# computing a gloabl p-value for Little's test including both lab1 and lab2 little_global <- smdi_little(data = data_monotone) # combining two individual lab smdi tables and global Little's test smdi_style_gt( smdi_object = rbind(lab1_diagnostics$smdi_tbl, lab2_diagnostics$smdi_tbl), include_little = little_global )
Since the missingness follows an MCAR mechanism, smdi_diagnose()
now shows the expected missingness diagnostics patterns one would expect from an MCAR mechanism.
Any scripts or data that you put into this service are public.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.