```{=html}
```r knitr::opts_chunk$set( collapse = TRUE, comment = "#>", dpi = 150, fig.width = 6, fig.height = 4.5 )
library(smdi) library(gt) library(dplyr) library(here) library(knitr)
smdi
main functionalitiesThe smdi
flagship function is smdi_diagnose()
which calls multiple sub-functions which are also accessible separately. This article aims to give an introduction to missing covariate data diagnostics using the individual smdi
diagnostics function that all funnel into the main function smdi_diagnose()
.
::: {.alert .alert-info} What is smdi about? Large-scale simulations revealed characteristic patterns of diagnostic parameters matched to common missing data structures based on three group diagnostics: :::
include_graphics(here("vignettes", "smdi_diagnose_table.png"))
::: {.alert .alert-info} How can this be applied to inform a real-world database study? The observed diagnostic patterns of a specific study will give insights into the likelihood of underlying missingness structures. This package enables researchers to create these three group diagnostics for their own healthcare database analytics with little effort. This is how an example could look like in a real-world database study: :::
include_graphics(here("vignettes", "smdi_examples.png"))
To illustrate the usage of the smdi
package main functions, we use the smdi_data
dataset which is an example dataset that comes bundled with the package and includes some ready to use simulated partially observed covariates. If you prefer to simulate missingness yourself, you can do so using the smdi_data_complete
dataset.
In brief, the smdi_data
dataset consists of a simulated lung cancer cohort with a fictional comparison of two antineoplastic systemic therapy regimens and a time-to-event outcome. More information on the underlying dataset is given in the previous Data generation article.
smdi_data %>% glimpse()
The dataset consists of r formatC(nrow(smdi_data), format = "fg", big.mark = ",")
patients and r ncol(smdi_data)
variables with exposure
representing the two treatment regimens under comparison and status
and eventtime
the vital status and censoring time, respectively. For more information, please checkout:
# dataset with simulated missingness ?smdi::smdi_data() # complete dataset ?smdi::smdi_data_complete()
As with basically any first step into exploring (new) datasets, it's a good idea to get an overview of partially observed covariates and the magnitude of missingness. For this, smdi
comes with two convenient functions to screen the data for missingness.
This can be either as a table ...
smdi_data %>% smdi_summarize()
... or visually
covars_missing <- smdi_summarize(data = smdi_data) %>% pull(covariate) smdi_data %>% smdi_vis(covar = covars_missing)
The plot also provides flexibility to stratify, e.g. by exposure
.
smdi_data %>% smdi_vis(covar = covars_missing, strata = "exposure")
Besides describing the proportion missingness by covariate (and potentially stratified by another variable), it is also of great importance to check the missingness patterns. This can give important hints if the missingness in two or more partially observed covariates may follow a monotone or non-monotone missingness pattern. If former is the case, one should be careful and potentially run the below-described smdi
diagnostics variable-by-variable instead of jointly.
We recommend checking both missingness proportions and patterns as a first step. In case of monotonocity, the
smdi
package may likely culminate in misleading results. Please check the article on multivariate missingness and monotonicity.
To check missingness patterns, the smdi
comes with re-exports of the naniar::gg_miss_upset
and mice::md.pattern
functions.
smdi::gg_miss_upset(data = smdi_data)
The upset plot and the pattern matrix show that only a small fraction of observations (N = 97) is missing in all the partially observed covariates at the same. A non-monotone missingness pattern is consequently more likely than a monotone one.
smdi::md.pattern(smdi_data[, c(covars_missing)], plot = FALSE)
Generally, the dataframe(s) used for smdi
(defined as the data
parameter in all functions) should consist of the partially observed covariates, other relevant observed covariates as well as the the exposure and outcome variables. Avoid having meta-variables such as e.g. ID variables, dates and zip codes in your dataframe as all present columns in your dataframe will be used to make inferences about potential missing data mechanisms.
As discussed in the documentation of the smdi_asmd
and smdi_hotelling
, the median/average standardized mean difference (asmd) may be one indicator as to how much patient characteristics differ between patients with and without an observed value for a partially observed covariate. If the median/average asmd is above a certain threshold this may indicate imbalance in patient covariate distributions which may be indicative of the partially observed covariate following a missing at random (MAR) mechanims, i.e., the missingness is explainable by other observed covariates. Similarly, no imbalance between observed covariates may be indicative that missingness cannot be explained with observed covariates and the underlying missingness mechanism may be completely at random (MCAR) or not at random (e.g., missingness is only associated with unobserved factors or through the partially observed covariate itself).
To get an idea about the asmd for our partially observed covariates we can run the smdi_asmd
function. We are ok with the default parameters (i.e., we let the function investigate all covariates with at least one NA [covar = NULL], compute the median asmd [median = TRUE] and don't explicitly model the missingness of other partially observed covariates [includeNA = FALSE]).
asmd <- smdi_asmd(data = smdi_data)
The output returns an asmd object that contains a lot of information in the following structure that can be accessed using the "\$" operator:
Here is an example of egfr_cat
:
asmd$egfr_cat$asmd_table1
asmd$egfr_cat$asmd_plot
asmd$egfr_cat$asmd_aggregate
To limit the output, we can use the generic print
or summary
output of the object which returns a summary table of the aggregate median/average asmd per covariate.
summary(asmd)
The smdi_hotelling
function follows the same logic, but is a formal hypothesis test for the difference in covariate distributions based on a multivariate student t-test.[@hotelling1992] The output (a hotteling object) follows the same structure as above. It's important to remember that the power of statistical hypothesis tests can be influenced by sample size, so the combined investigation along with smdi_asmd()
is highly recommended.
h0 <- smdi_hotelling(data = smdi_data) h0
More details can be accessed for each covariate.
h0$ecog_cat
Little proposes a single global test statistic for MCAR that uses all of the available data.[@little1988a] Hence, the smdi_little
does not return one test statistic per partially observed covariate but one globally for the entire dataset.
h0_global <- smdi_little(data = smdi_data) h0_global
CAVE: Hotelling's and Little's show high susceptibility with large sample sizes and it is recommended to always interpret the results along with the other diagnostics.
Since MAR mechanisms are defined as the missingness being a function of observed covariates, MAR may be predictable and evaluable by fitting a classification model, e.g., a random forest model, which would yield moderate to high area under the curve (AUC) values in the case of MAR.
The smdi_rf
function trains and fits a random forest model to assess the ability to predict missingness for the specified covariate(s). If the missing indicator for this covariate can be predicted as a function of observed covariates, a MAR missingness mechanism may be likely.
CAVE: Depending on the amount of data (sample size x covariates), the computation of the function can take some minutes.
The structure of the output again follows the general schema.
auc <- smdi_rf(data = smdi_data) auc$ecog_cat$rf_table auc$ecog_cat$rf_plot
CAVE: If the missingness indicator variables of other partially observed covariates (indicated by suffix _NA) have an extremely high variable importance (combined with an unusually high AUC), this might be an indicator of a monotone missing data pattern. That is, the missingness in one covariate is highly predictive of the missingness of another partially observed covariate. In this case it is advisable to exclude other partially observed covariates and run missingness diagnostics separately. This can be checked, e.g. with the
mice::md.pattern()
function (mice
package).
The group 3 diagnostic focuses on assessing the association between the missing indicator of the partially observed covar
and the outcome under study. This may reveal important covariate relationships with the outcome and could give additional pieces of information.
Currently, all main types of outcome regressions are supported, namely glm, linear (lm) and Cox proportional hazards (survival) regression models are supported and need to be specified using the model and form_lhs parameters. If glm is specified, any glm_family regression type can be used which includes binomial (default), gaussian, Gamma, inverse.gaussian, poisson, quasi, quasibinomial and quasipoisson (for more information see ?stats::family
). Further details on the smdi_outcome
function can be accessed via
?smdi::smdi_outcome()
The output of smdi_outcome
returns a table of the univariate and adjusted beta coefficients and 95% confidence intervals for all covar
.
smdi_outcome( data = smdi_data, model = "cox", form_lhs = "Surv(eventtime, status)" )
smdi_diagnose()
- one function to rule them allFinally, all of the functions above funnel into smdi_diagnose()
which outputs an smdi object including a summary table of all three smdi group diagnostics and little's global p-value statistic. If details of the individual functions above are needed, this method may not be preferable but is a convenient way to implement routine structural missing covariate diagnostics by calling a single function.
Note that all parameters of the individual functions that make up smdi_diagnose()
can be specified and will be passed on, but only the required parameter must be specified. A most minimal example could look like this.
diagnostics <- smdi_diagnose( data = smdi_data, covar = NULL, # NULL includes all covariates with at least one NA model = "cox", form_lhs = "Surv(eventtime, status)" )
The output returns two parts: the smdi_tbl
which can be called using the $
operator and looks like this
diagnostics$smdi_tbl
... and the p-value of the global Little's test statistic:
diagnostics$p_little
gt
-style tableFor a nicely formatted and publication-ready output we can subsequently use the smdi_diagnose
output and feed it into the smdi_gt_style()
function:
library(gt) smdi_style_gt(diagnostics)
To make this even more convenient, gt
offers a functionality to export the table in different formats, e.g. .docx
, .png
, .pdf
or .rtf
:
gtsave( data = smdi_style_gt(diagnostics), filename = "smdi_table.docx", # name of the final .docx file path = "." # path where the file should be stored )
Any scripts or data that you put into this service are public.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.