In hestiri/mlho: Machine Learning for Health Outcomes (MLHO)

knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)

# make sure the package is installed --> devtools::install_github("hestiri/mlho")
library(mlho)

Install and load the required packages:

if(!require(pacman)) install.packages("pacman")

pacman::p_load(data.table, devtools, backports, Hmisc, tidyr,dplyr,ggplot2,plyr,scales,readr,Rmisc,
               httr, DT, lubridate, tidyverse,reshape2,foreach,doParallel,caret,gbm,lubridate,praznik)

Multivariate PheWAS analysis with MLHO

This document demonstrates how you can use MLHO in iterative modeling to perform phenome-wide association studies.

Again, I will use the syntheticmass data, which I downloaded and prepared according to MLHO input data model from SyntheticMass, generated by SyntheaTM, an open-source patient population simulation made available by The MITRE Corporation.

data("syntheticmass")

Iterative MLHO implementation

To identify features that are useful for your prediction task, you can iteratively apply MLHO so the train-test sampling is repeated. In function mlho.it this is possible. The MSMR.lite and mlearn functionalities are all included in this function.

See the help to learn more about parameters.

data.table::setDT(dbmart)
dbmart[,row := .I]
dbmart$value.var <- 1


mlho.features <- mlho.it(dbmart,
                  labels = labeldt,
                  dems,
                  test.sample = 25,
                  MSMR.sparsity=0.001,
                  MSMR.topn=300,
                  mlearn.note="mlho phewas test run",
                  mlearn.aoi="prediabetes",
                  multicore=TRUE,
                  iterations=7)

here are your features:

datatable(dplyr::select(mlho.features,features,DESCRIPTION), options = list(pageLength = 5), filter = 'bottom')

Computing MLHO confidence scores (CS)

As an alternative to p-value adjustment, we have developed a method for computing confidence scores for features that MLHO identifies, using mlho.features file saved above.

mlho.scores <- mlho.cs(mlho.features)

the confidence score (cs) is based on the number of iterations and considers consensus in a feature identified as positive/negative in each iteration. For our PASC article, we used features with cs>80. For this demonstration, let's do cs >= 80. We will use these features to run a GLM model without any regularization to obtain multivariate coefficients.

mlho.features <- c(as.character(subset(mlho.scores$features,mlho.scores$cs >= 80)))

we got r length(mlho.features) features for our multivariate GLM model -- down from r length(unique(dbmart$phenx)).

Look at MLHO Vanilla implementation article to refresh these next steps. For the purpose of extracting the GLM coefficients, we implement mlearn slightly differently, where there's no test set and the model is developed on the entire population. To do this we set dat.train to NULL.

dbmart <- subset(dbmart,dbmart$phenx %in% mlho.features)
setDT(dbmart)
dbmart[,row := .I]
dbmart$value.var <- 1
uniqpats <- c(as.character(unique(dbmart$patient_num)))
GLM.data <- MSMSR.lite(MLHO.dat=dbmart,patients = uniqpats,sparsity=NA,jmi = FALSE,labels = labeldt)
GLM.output <- mlearn(dat.train=GLM.data,
                   dat.test=NULL,
                   dems=dems,
                   save.model=FALSE,
                   classifier="GLM",
                   note="extracting_coefficients",
                   aoi="prediabetes",
                   multicore=FALSE)
datatable(GLM.output, options = list(pageLength = 5), filter = 'bottom')

You can see that off all the r length(mlho.features) features that MLHO identified with confidence score >= 80, a GLM model found r nrow(subset(GLM.output,GLM.output$P <= 0.05))!

Now let's visualize the mean coefficients.

dbmart.concepts <- dbmart[!duplicated(paste0(dbmart$phenx)), c("phenx","DESCRIPTION")]
GLM.output <- data.frame(merge(GLM.output,dbmart.concepts,by.x="features",by.y ="phenx",all.x=T))
GLM.output$DESCRIPTION <- ifelse(is.na(GLM.output$DESCRIPTION),GLM.output$features,GLM.output$DESCRIPTION)

(plot<- ggplot(GLM.output) +
    geom_segment(
      aes(y = 1,
          x = reorder(DESCRIPTION,OR),
          yend = OR,
          xend = DESCRIPTION),
      size=0.5,alpha=0.5) +
    geom_point(
      aes(x=reorder(DESCRIPTION,OR),y=OR),
      alpha=0.5,size=2,color="red") +
    theme_minimal()+
    scale_y_continuous(limits=c(0,10))+
   coord_flip()+
    labs(y="Odds Ratio",x=""))