knitr::opts_chunk$set(echo = TRUE, message=F, warning=F)
Firstly we should ensure all variables of interest are in the correct formats we need. Variables should either be factors or numerical as appropriate. We will be using the survival::colon
dataset as the basis of our example.
library(tidyverse); library(finalfit); library(survival) data <- tibble::as_tibble(survival::colon) %>% dplyr::filter(etype==2) %>% # Outcome of interest is death dplyr::filter(rx!="Obs") %>% # rx will be our binary treatment variable dplyr::select(-etype,-study, -status) %>% # Remove superfluous variables # Convert into numeric and factor variables dplyr::mutate_at(vars(obstruct, perfor, adhere, node4), function(x){factor(x, levels=c(0,1), labels = c("No", "Yes"))}) %>% dplyr::mutate(rx = factor(rx), mort365 = cut(time, breaks = c(-Inf, 365, Inf), labels = c("Yes", "No")), sex = factor(sex, levels=c(0,1), labels = c("Female", "Male")), differ = factor(differ, levels = c(1,2,3), labels = c("Well", "Moderate", "Poor")), adhere = as.character(adhere), extent = factor(extent, levels = c(1,2,3, 4), labels = c("Submucosa", "Muscle", "Serosa", "Contiguous Structures")), surg = factor(surg, levels = c(0,1), labels = c("Short", "Long"))) %>% # Logical value for outcome (for survival analysis) dplyr::mutate(status = mort365=="Yes")
Before considering imputation, we first need to determine if there is missing data, and whether the data is missing at random (MAR) or missing not at random data (MNAR). In general, imputation cannot be used when data is MNAR (e.g. there is clear bias in what data is missing).
There are excellent outlines of how to explore and determine this using the finalfit
package already here. We will assume this has been done, and the data is MAR.
Firstly, we want to remove variables we want to ensure the multiple imputation algorithm is based on the appropriate variables. You should define:
impute
(Essential): The variables you want to be imputed. There is significant debate whether outcomes should be imputed or not, however outcome is often imputed as best practice as shown below.
fixed
(Optional): Any variables you want to be present in the dataset AND to be used in the imputation algorithm, BUT to remain unimputed.
ignore
(Optional): Any variables you want to be present in the dataset, but to NOT include in the algorithm AND remain unimputed.
All other arguments from the mice::mice()
function can be used to customise imputation. For example, the number of imputations to be performed (m
).
var_dep = "mort365" var_exp = c("rx", "sex", "age", "obstruct", "nodes", "surg") var_fixed = c("perfor") var_ignore = c("adhere") impdata <- data %>% finalpsm::impute(impute = c(var_dep, var_exp), fixed = var_fixed, ignore = var_ignore, m=5)
A nested dataset is returned containing the original (unimputed) data, with each imputed dataset nested (below). Each dataset has a unique number corresponding to it (m
) and the respective patient (rowid
). Use tidyr::unnest()
to convert into a long format.
impdata %>% mutate(data = "< tibble >") %>% knitr::kable()
The finalaux::finalimp()
function works functionally the same as the finalfit::finalfit()
function. The dependent, explanatory, and random_effect variables can be defined, and the appropriate model is chosen based on these.
output <- impdata %>% finalpsm::imputemutate(nodes5 = ifelse(nodes>=5, "Yes", "No")) %>% finalpsm::finalimp(dependent = var_dep, explanatory = c(var_exp, "nodes5"))
There are 2 outputs:
$table
which provides a finalfit table ready for output which contains both the unimputed and imputed models.knitr::kable(output$table)
$model
which provides all the underlying data, models, and metrics for the original and imputed datasets.
All datasets have an extra column added (predict
) containing the predicted probability (0-1) of the outcome (based on the model).
output$model %>% dplyr::mutate(data = "tibble", model = "S3:glm") %>% knitr::kable()
However, it may be you want to pool other models that are outwith the scope of finalfit (or to not use finalfit at all!). This is made easier in the finalaux
package by how the imputed data has been structured.
Firstly run your model across all the nested datasets (original and imputed) using purrr::map.
Retain only your model fits from the output (a single model nested per row of the tibble).
Pool the imputed models using pool2df
to get a finalfit::fit2df
style table for the unimputed and imputed models.
imp_match <- impdata %>% # Run your model across every nested dataset ("data") dplyr::mutate(model = purrr::map(data, function(x){x %>% dplyr::select(-rowid) %>% finalpsm::match(strata = "rx", explanatory = c("sex", "age", "obstruct", "nodes", "surg"), dependent = c("mort365", "time"), method = "full") %>% finalpsm::finalpsm(dependent = "mort365", balance = F)})) %>% # Extract only your model fit from your output (multiple outputs from finalpsm) tidyr::unnest(model) %>% dplyr::mutate(type = purrr::map_chr(model, function(x){paste0(class(x), collapse = ", ")})) %>% dplyr::filter(type == "glmerMod") # Pool Output finalpsm::pool2df(imp_match, model = "model")$joined %>% knitr::kable()
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.