In hestiri/mlho: Machine Learning for Health Outcomes (MLHO)

knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)

# make sure the package is installed --> devtools::install_github("hestiri/mlho")
library(mlho)

Install and load the required packages:

if(!require(pacman)) install.packages("pacman")

pacman::p_load(data.table, devtools, backports, Hmisc, tidyr,dplyr,ggplot2,plyr,scales,readr,
               httr, DT, lubridate, tidyverse,reshape2,foreach,doParallel,caret,gbm,lubridate,praznik)

MLHO Vanilla implementation

Here we go through the vanilla implementation of MLHO using the synthetic data provided in the package.

The synthetic data syntheticmass downloaded and prepared according to MLHO input data model from SyntheticMass, generated by SyntheaTM, an open-source patient population simulation made available by The MITRE Corporation.

data("syntheticmass")

here's how dbmart table looks (there's an extra column that is the translation of the phenx concepts):

head(dbmart)

our demographic table contains: r colnames(dems).

split data into train-test

uniqpats <- c(as.character(unique(dbmart$patient_num)))
#using a 70-30 ratio
test_ind <- sample(uniqpats,
                   round(.3*length(uniqpats)))

test_labels <- subset(labeldt,labeldt$patient_num %in% c(test_ind))
print("test set lables:")
table(test_labels$label)
train_labels <- subset(labeldt,!(labeldt$patient_num %in% c(test_ind)))
print("train set lables:")
table(train_labels$label)
# train and test sets 
dat.train  <- subset(dbmart,!(dbmart$patient_num %in% c(test_ind)))
dat.test <- subset(dbmart,dbmart$patient_num %in% c(test_ind))

now dimensionality reduction on training set

MSMR lite

from here, we will split the data into a train and a test set and apply MSMR.lite to the training data

data.table::setDT(dat.train)
dat.train[,row := .I]
dat.train$value.var <- 1
uniqpats.train <- c(as.character(unique(dat.train$patient_num)))

##here is the application of MSMR.lite
dat.train <- MSMSR.lite(MLHO.dat=dat.train,
                        patients = uniqpats.train,
                        sparsity=0.005, 
                        labels = labeldt,
                        topn=200)

Notice that we are removing concepts that had prevalence less than 0.5% and then only taking the top 200 after the JMI rankings. See help (?mlho::MSMSR.lite) for MSMR.lite parameters.

Now we have the training data with the top 200 features, to which we can add demographic features. Now on to prepping the test set:

dat.test <- subset(dat.test,dat.test$phenx %in% colnames(dat.train))
setDT(dat.test)
dat.test[,row := .I]
dat.test$value.var <- 1
uniqpats.test <- c(as.character(unique(dat.test$patient_num)))

dat.test <- MSMSR.lite(MLHO.dat=dat.test,patients = uniqpats.test,sparsity=NA,jmi = FALSE,labels = labeldt)

Here notice that we only used MSMR.lite to generate the wide table that matches to the dat.train table.

Modeling

we will use the mlearn function to do the modeling, which includes training the model and testing it on the test set.

model.test <- mlearn(dat.train,
                   dat.test,
                   dems=dems,
                   save.model=FALSE,
                   classifier="gbm",
                   note="mlho_terst_run",
                   cv="cv",
                   nfold=5,
                   aoi="prediabetes",
                   multicore=FALSE)

try ?mlho::mlearn to learn more about mlearn parameters.

you can select a variety of models. For binray classification tasks, chose from:ORFlog, RRF,gbm,bayesglm,regLogistic,xgbDART,regLogistic, glmnet, bayesglm, nb, svmRadialWeights,avNNet,and ordinalRF. see the list and description of all available models

`mlearn` outputs

outputs of mlearn include:

model performance metrics: model.test$ROC
in model.test AUROC is r model.test$ROC$roc
final important features, which can be pulled and save from model.test$features
absolute errors for each patients using model.test$AE
missing important features model.test$missing.features

For instance, only the following r nrow(model.test$features) features were used in phenotyping the prediabetes:

model.test$features

we recommend iterating the training and testing and storing the coefficients in a directory.

Visualizing the outputs

Here we create a plot of the feature importance scores for each of the top (here we have r nrow(model.test$coefficients)) predictors identified by MLHO.

To do so, let's map the concept codes to their "English" translation. That's why we kept that 4th column called description in dbmart.

dbmart.concepts <- dbmart[!duplicated(paste0(dbmart$phenx)), c("phenx","DESCRIPTION")]
mlho.features <- data.frame(merge(model.test$features,dbmart.concepts,by.x="features",by.y = "phenx"))
datatable(dplyr::select(mlho.features,features,DESCRIPTION,`Feature importance`=Overall), options = list(pageLength = 5), filter = 'bottom')

now visualizing feature importance

(plot<- ggplot(mlho.features) +
    geom_segment(
      aes(y = 0,
          x = reorder(DESCRIPTION,Overall),
          yend = Overall,
          xend = DESCRIPTION),
      size=0.5,alpha=0.5) +
    geom_point(
      aes(x=reorder(DESCRIPTION,Overall),y=Overall),
      alpha=0.5,size=2,color="red") +
    theme_minimal()+
   coord_flip()+
    labs(y="Feature importance",x=""))

Notes for more complex implementation

As we recommended above, it's best to at least iterate the training process 5 times, in which case you'll have 5 files containing the important features from MLHO. We do this regularly. If you do so, you can compute the mean/median values for each feature.

Now if you want to get the regression coefficients and compute Odds Ratios (ORs), you can then select the top feature that were selected most frequently in MLHO iterations to run a regression model (pass glm to mlearn) and extract regression coefficients to then transform to ORs.

hestiri/mlho documentation built on March 20, 2023, 11:04 p.m.

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

Tweet to @rdrrHQ

GitHub issue tracker

ian@mutexlabs.com