Firstly we should ensure all variables of interest are in the correct
formats we need. Variables should either be factors or numerical as
appropriate. We will be using the survival::colon
dataset as the basis
of our example.
library(tidyverse); library(finalfit); library(survival)
data <- tibble::as_tibble(survival::colon) %>%
dplyr::filter(etype==2) %>% # Outcome of interest is death
dplyr::filter(rx!="Obs") %>% # rx will be our binary treatment variable
dplyr::select(-etype,-study, -status) %>% # Remove superfluous variables
# Convert into numeric and factor variables
dplyr::mutate_at(vars(obstruct, perfor, adhere, node4), function(x){factor(x, levels=c(0,1), labels = c("No", "Yes"))}) %>%
dplyr::mutate(rx = factor(rx),
mort365 = cut(time, breaks = c(-Inf, 365, Inf), labels = c("Yes", "No")),
sex = factor(sex, levels=c(0,1), labels = c("Female", "Male")),
differ = factor(differ, levels = c(1,2,3), labels = c("Well", "Moderate", "Poor")),
adhere = as.character(adhere),
extent = factor(extent, levels = c(1,2,3, 4), labels = c("Submucosa", "Muscle", "Serosa", "Contiguous Structures")),
surg = factor(surg, levels = c(0,1), labels = c("Short", "Long"))) %>%
# Logical value for outcome (for survival analysis)
dplyr::mutate(status = mort365=="Yes")
Before considering imputation, we first need to determine if there is missing data, and whether the data is missing at random (MAR) or missing not at random data (MNAR). In general, imputation cannot be used when data is MNAR (e.g. there is clear bias in what data is missing).
There are excellent outlines of how to explore and determine this using
the finalfit
package already
here.
We will assume this has been done, and the data is MAR.
Firstly, we want to remove variables we want to ensure the multiple imputation algorithm is based on the appropriate variables. You should define:
impute
(Essential): The variables you want to be imputed.
There is significant debate whether outcomes should be imputed or
not, however outcome is often imputed as best practice as shown
below.
fixed
(Optional): Any variables you want to be present in the
dataset AND to be used in the imputation algorithm, BUT to remain
unimputed.
ignore
(Optional): Any variables you want to be present in the
dataset, but to NOT include in the algorithm AND remain unimputed.
All other arguments from the mice::mice()
function can be used to
customise imputation. For example, the number of imputations to be
performed (m
).
var_dep = "mort365"
var_exp = c("rx", "sex", "age", "obstruct", "nodes", "surg")
var_fixed = c("perfor")
var_ignore = c("adhere")
impdata <- data %>%
finalpsm::impute(impute = c(var_dep, var_exp), fixed = var_fixed, ignore = var_ignore, m=5)
A nested dataset is returned containing the original (unimputed) data,
with each imputed dataset nested (below). Each dataset has a unique
number corresponding to it (m
) and the respective patient (rowid
).
Use tidyr::unnest()
to convert into a long format.
The finalaux::finalimp()
function works functionally the same as the
finalfit::finalfit()
function. The dependent, explanatory, and
random_effect variables can be defined, and the appropriate model is
chosen based on these.
output <- impdata %>%
finalpsm::imputemutate(nodes5 = ifelse(nodes>=5, "Yes", "No")) %>%
finalpsm::finalimp(dependent = var_dep, explanatory = c(var_exp, "nodes5"))
There are 2 outputs:
$table
which provides a finalfit table ready for output which
contains both the unimputed and imputed models.$model
which provides all the underlying data, models, and metrics
for the original and imputed datasets.
All datasets have an extra column added (predict
) containing the
predicted probability (0-1) of the outcome (based on the model).
However, it may be you want to pool other models that are outwith the
scope of finalfit (or to not use finalfit at all!). This is made easier
in the finalaux
package by how the imputed data has been structured.
Firstly run your model across all the nested datasets (original and imputed) using purrr::map.
Retain only your model fits from the output (a single model nested per row of the tibble).
Pool the imputed models using pool2df
to get a finalfit::fit2df
style table for the unimputed and imputed models.
imp_match <- impdata %>%
# Run your model across every nested dataset ("data")
dplyr::mutate(model = purrr::map(data, function(x){x %>%
dplyr::select(-rowid) %>%
finalpsm::match(strata = "rx",
explanatory = c("sex", "age", "obstruct", "nodes", "surg"),
dependent = c("mort365", "time"),
method = "full") %>%
finalpsm::finalpsm(dependent = "mort365", balance = F)})) %>%
# Extract only your model fit from your output (multiple outputs from finalpsm)
tidyr::unnest(model) %>%
dplyr::mutate(type = purrr::map_chr(model, function(x){paste0(class(x), collapse = ", ")})) %>%
dplyr::filter(type == "glmerMod")
# Pool Output
finalpsm::pool2df(imp_match, model = "model")$joined %>% knitr::kable()
explanatory
OR_multi_unimp
OR_multi_imp
rxLev+5FU
0.95 (0.51-1.78, p=0.872)
0.90 (0.46-1.76 p=0.755)
sexMale
0.78 (0.42-1.48, p=0.451)
0.76 (0.37-1.55 p=0.443)
age
0.98 (0.95-1.00, p=0.090)
0.97 (0.94-1.01 p=0.148)
obstructYes
0.35 (0.17-0.71, p=0.003)
0.40 (0.16-0.99 p=0.047)
nodes
0.86 (0.79-0.93, p<0.001)
0.86 (0.79-0.94 p=0.001)
surgLong
0.50 (0.26-0.98, p=0.045)
0.72 (0.35-1.47 p=0.361)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.