knitr::opts_chunk$set(echo = TRUE)
This is an R Markdown document. Markdown is a simple formatting syntax for authoring HTML, PDF, and MS Word documents. For more details on using R Markdown see http://rmarkdown.rstudio.com.
When you click the Knit button a document will be generated that includes both content as well as the output of any embedded R code chunks within the document. You can embed an R code chunk like this:
library(tidyverse) library(rms) library(mice) library(ggRandomForests) library(survival) library(missRanger) library(survminer) library(glmnet) library(pROC) library(rfSLAM) library(grid) library(gridExtra) library(ggsci) library(pROC) library(splines) library(viridis) library(rpart) library(compareGroups) library(tidyverse) library(lubridate) library(kableExtra) library(rpart.plot) setwd("~/workspace/SAFE/code/SLAM_ML/dshenker/RFSLAM_PACKAGE/RFSLAM") devtools::load_all()
Load in our example dataset which is in the correct final format
df <- readRDS("~/workspace/SAFE/code/SLAM_ML/dshenker/sample_modeling_df/rf_df_1_withvariantsplusimputed.rda") drop <- c("rt_1", "rt_7", "rt_7p", "i.sev_died.7") analysis_vec <- readRDS("~/workspace/SAFE/data/curated_data/SLAM_ML/for_templates/analysis_vector_Sep_22.rds") analysis_vars <- c(analysis_vec, "dominant_clade", "vaccinated", "remdesivir", "remdesivir_24", "remdesivir_since_admit", "systemic_steroid", "systemic_steroid_24", "systemic_steroid_since_admit", "tocilizumab", "tocilizumab_admit", "variant") all_vars_for_model <- c(analysis_vars, "rt_1", "rt_7", "rt_1p", "rt_7p", "i.sev_died.1", "i.sev_died.7", "pid", "int.n") var_key <- read_csv("~/workspace/SAFE/code/SLAM_ML/dshenker/var_key.csv") #example variable key
The initial step before using RF-SLAM once we have our data ready is to create our “risk time” information. These risk times tell RF-SLaM for what proportion of the interval of interest is the patient actually at risk. Here is a guide to the function calls: - First variable is the data frame being operated on. - Second variable is the column containing the time of the FINAL available data for the patient (could be the time of treatment end, death, loss to followup, etc.). - The third variable is the column name for where the currrent time is stored. - Fourth variable is the length of the time intervals. - Fifth variable is the value you want to normalize the times by. - Finally, the last variable is what you would like the new column to be named. Note that there will also be a second new column that is the adjusted version.
df <- as.data.frame(calc_risk_times(df, "t.sd", "q6", 6, 24, "rt_1")) %>% select(all_vars_for_model[all_vars_for_model %in% names(.)]) %>% mutate_if(sapply(.,is.character), as.factor)
We need a variable to use to split for the cross validation folds. Here, we will use a binary variable which tells us if a patient was EVER in a severe condition. Let’s create that column here.
severe_ever <- df %>% group_by(pid) %>% slice(which.max(i.sev_died.1)) %>% select(pid, i.sev_died.1) colnames(severe_ever) <- c("pid", "sev_ever") df <- df %>% left_join(severe_ever, by = "pid")
The bulk of the RF-SLAM code is strategically hidden in the “rf_functions.R” file. The user of the package themselves must make the following function calls. The purpose of each one is described here briefly: - create_model: runs the RFSLAM algorithm on the given dataframe - feature_importance_plot: plots the variable importance in the RF-SLAM model along with the variable information provided in the corresponding key - calibrate.model: calibrates the model predictions to align with the dataset - show_aucs: calculate AUC values at pre-specified time values (currently these are preset in “rf_functions” but will be soon customizable) - analysis_plots: create the calibration plots and rpart visualizations of the model - save_risk_vals: save the dataframe along with the corresponding predictions to an RDS file. ## Run RF-SLAM and the corresponding analysis
df$i.sev_died.1_final <- relevel(df$i.sev_died.1, "0") df$i.sev_died.1 <- NULL target <- "i.sev_died.1_final" drop <- c(drop, "i.sev_died.1", "sev_ever") best_params <- tune_rf_params(df, target, id_col = "pid", risk_time_col = "rt_1p", patient_count_col = "int.n", time_col = "q6", drop, ntree_trys = c(100, 200), nodedepth_trys = c("NULL", 3), nsplit_trys = c(5, 10), n.folds = 3, folds_stratifier = "sev_ever")
We can now look at the table saved in best_params to determine what parameters we’d like to use for modeling.
best_params
We can see that the maximum average AUC across folds comes wen we use 100 trees, a nodedepth of 3, and nsplit equal to 5. Thus, let’s train our final model using those parameter settings.
final_model <- create_model(df[,!(names(df) %in% drop[drop != sym(target)])], target, id_col = "pid", risk_time_col = "rt_1p", patient_count_col = "int.n", "q6", ntree = 100, nodedepth = 3, nsplit = 5) mymodel.1.full_1day <- final_model$model p.cpiu.be.ppl_1day <- final_model$preds feature_importance_plot(mymodel.1.full_1day, var_key) #plot feature importance df$p.hat <- calibrate.model(p.cpiu.be.ppl_1day, df, target, "q6") analysis_plots(df, "i.sev_died.1_final", "pid", "p.hat", "q6", analysis_vars)
Finally we can save the risk values.
save_risk_vals(df, "~/workspace/SAFE/code/SLAM_ML/dshenker/Risk_DFs_ForTrees/df_withall_generalframeworkexperiment.Rds")
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.