knitr::opts_chunk$set( collapse = TRUE, comment = "#>" )
# make sure the package is installed --> devtools::install_github("hestiri/mlho") library(mlho)
Install and load the required packages:
if(!require(pacman)) install.packages("pacman") pacman::p_load(data.table, devtools, backports, Hmisc, tidyr,dplyr,ggplot2,plyr,scales,readr, httr, DT, lubridate, tidyverse,reshape2,foreach,doParallel,caret,gbm,lubridate,praznik)
Here we go through the vanilla implementation of MLHO using the synthetic data provided in the package.
The synthetic data syntheticmass
downloaded and prepared according to MLHO input data model from SyntheticMass, generated by SyntheaTM, an open-source patient population simulation made available by The MITRE Corporation.
data("syntheticmass")
here's how dbmart
table looks (there's an extra column that is the translation of the phenx
concepts):
head(dbmart)
our demographic table contains: r colnames(dems)
.
uniqpats <- c(as.character(unique(dbmart$patient_num))) #using a 70-30 ratio test_ind <- sample(uniqpats, round(.3*length(uniqpats))) test_labels <- subset(labeldt,labeldt$patient_num %in% c(test_ind)) print("test set lables:") table(test_labels$label) train_labels <- subset(labeldt,!(labeldt$patient_num %in% c(test_ind))) print("train set lables:") table(train_labels$label) # train and test sets dat.train <- subset(dbmart,!(dbmart$patient_num %in% c(test_ind))) dat.test <- subset(dbmart,dbmart$patient_num %in% c(test_ind))
now dimensionality reduction on training set
from here, we will split the data into a train and a test set and apply MSMR.lite
to the training data
data.table::setDT(dat.train) dat.train[,row := .I] dat.train$value.var <- 1 uniqpats.train <- c(as.character(unique(dat.train$patient_num))) ##here is the application of MSMR.lite dat.train <- MSMSR.lite(MLHO.dat=dat.train, patients = uniqpats.train, sparsity=0.005, labels = labeldt, topn=200)
Notice that we are removing concepts that had prevalence less than 0.5% and then only taking the top 200 after the JMI rankings. See help (?mlho::MSMSR.lite
) for MSMR.lite
parameters.
Now we have the training data with the top 200 features, to which we can add demographic features. Now on to prepping the test set:
dat.test <- subset(dat.test,dat.test$phenx %in% colnames(dat.train)) setDT(dat.test) dat.test[,row := .I] dat.test$value.var <- 1 uniqpats.test <- c(as.character(unique(dat.test$patient_num))) dat.test <- MSMSR.lite(MLHO.dat=dat.test,patients = uniqpats.test,sparsity=NA,jmi = FALSE,labels = labeldt)
Here notice that we only used MSMR.lite
to generate the wide table that matches to the dat.train
table.
we will use the mlearn
function to do the modeling, which includes training the model and testing it on the test set.
model.test <- mlearn(dat.train, dat.test, dems=dems, save.model=FALSE, classifier="gbm", note="mlho_terst_run", cv="cv", nfold=5, aoi="prediabetes", multicore=FALSE)
try ?mlho::mlearn
to learn more about mlearn
parameters.
you can select a variety of models. For binray classification tasks, chose from:ORFlog, RRF,gbm,bayesglm,regLogistic,xgbDART,regLogistic, glmnet, bayesglm, nb, svmRadialWeights,avNNet,and ordinalRF. see the list and description of all available models
mlearn
outputsoutputs of mlearn
include:
model.test$ROC
r model.test$ROC$roc
model.test$features
model.test$AE
model.test$missing.features
For instance, only the following r nrow(model.test$features)
features were used in phenotyping the prediabetes
:
model.test$features
we recommend iterating the training and testing and storing the coefficients in a directory.
Here we create a plot of the feature importance scores for each of the top (here we have r nrow(model.test$coefficients)
) predictors identified by MLHO.
To do so, let's map the concept codes to their "English" translation. That's why we kept that 4th column called description
in dbmart
.
dbmart.concepts <- dbmart[!duplicated(paste0(dbmart$phenx)), c("phenx","DESCRIPTION")] mlho.features <- data.frame(merge(model.test$features,dbmart.concepts,by.x="features",by.y = "phenx")) datatable(dplyr::select(mlho.features,features,DESCRIPTION,`Feature importance`=Overall), options = list(pageLength = 5), filter = 'bottom')
now visualizing feature importance
(plot<- ggplot(mlho.features) + geom_segment( aes(y = 0, x = reorder(DESCRIPTION,Overall), yend = Overall, xend = DESCRIPTION), size=0.5,alpha=0.5) + geom_point( aes(x=reorder(DESCRIPTION,Overall),y=Overall), alpha=0.5,size=2,color="red") + theme_minimal()+ coord_flip()+ labs(y="Feature importance",x=""))
As we recommended above, it's best to at least iterate the training process 5 times, in which case you'll have 5 files containing the important features from MLHO. We do this regularly. If you do so, you can compute the mean/median values for each feature.
Now if you want to get the regression coefficients and compute Odds Ratios (ORs), you can then select the top feature that were selected most frequently in MLHO iterations to run a regression model (pass glm
to mlearn
) and extract regression coefficients to then transform to ORs.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.