knitr::opts_chunk$set( collapse = TRUE, comment = "#>" )
# make sure the package is installed --> devtools::install_github("hestiri/mlho") library(mlho)
Install and load the required packages:
if(!require(pacman)) install.packages("pacman") pacman::p_load(data.table, devtools, backports, Hmisc, tidyr,dplyr,ggplot2,plyr,scales,readr,Rmisc, httr, DT, lubridate, tidyverse,reshape2,foreach,doParallel,caret,gbm,lubridate,praznik)
This document demonstrates how you can use MLHO in iterative modeling to perform phenome-wide association studies.
Again, I will use the syntheticmass
data, which I downloaded and prepared according to MLHO input data model from SyntheticMass, generated by SyntheaTM, an open-source patient population simulation made available by The MITRE Corporation.
data("syntheticmass")
To identify features that are useful for your prediction task, you can iteratively apply MLHO so the train-test sampling is repeated. In function mlho.it
this is possible. The MSMR.lite
and mlearn
functionalities are all included in this function.
See the help to learn more about parameters.
data.table::setDT(dbmart) dbmart[,row := .I] dbmart$value.var <- 1 mlho.features <- mlho.it(dbmart, labels = labeldt, dems, test.sample = 25, MSMR.sparsity=0.001, MSMR.topn=300, mlearn.note="mlho phewas test run", mlearn.aoi="prediabetes", multicore=TRUE, iterations=7)
here are your features:
datatable(dplyr::select(mlho.features,features,DESCRIPTION), options = list(pageLength = 5), filter = 'bottom')
As an alternative to p-value adjustment, we have developed a method for computing confidence scores for features that MLHO identifies, using mlho.features file saved above.
mlho.scores <- mlho.cs(mlho.features)
the confidence score (cs) is based on the number of iterations and considers consensus in a feature identified as positive/negative in each iteration. For our PASC article, we used features with cs>80
. For this demonstration, let's do cs >= 80
. We will use these features to run a GLM model without any regularization to obtain multivariate coefficients.
mlho.features <- c(as.character(subset(mlho.scores$features,mlho.scores$cs >= 80)))
we got r length(mlho.features)
features for our multivariate GLM model -- down from r length(unique(dbmart$phenx))
.
Look at MLHO Vanilla implementation article to refresh these next steps. For the purpose of extracting the GLM coefficients, we implement mlearn
slightly differently, where there's no test set and the model is developed on the entire population. To do this we set dat.train
to NULL
.
dbmart <- subset(dbmart,dbmart$phenx %in% mlho.features) setDT(dbmart) dbmart[,row := .I] dbmart$value.var <- 1 uniqpats <- c(as.character(unique(dbmart$patient_num))) GLM.data <- MSMSR.lite(MLHO.dat=dbmart,patients = uniqpats,sparsity=NA,jmi = FALSE,labels = labeldt) GLM.output <- mlearn(dat.train=GLM.data, dat.test=NULL, dems=dems, save.model=FALSE, classifier="GLM", note="extracting_coefficients", aoi="prediabetes", multicore=FALSE) datatable(GLM.output, options = list(pageLength = 5), filter = 'bottom')
You can see that off all the r length(mlho.features)
features that MLHO identified with confidence score >= 80, a GLM model found r nrow(subset(GLM.output,GLM.output$P <= 0.05))
!
Now let's visualize the mean coefficients.
dbmart.concepts <- dbmart[!duplicated(paste0(dbmart$phenx)), c("phenx","DESCRIPTION")] GLM.output <- data.frame(merge(GLM.output,dbmart.concepts,by.x="features",by.y ="phenx",all.x=T)) GLM.output$DESCRIPTION <- ifelse(is.na(GLM.output$DESCRIPTION),GLM.output$features,GLM.output$DESCRIPTION) (plot<- ggplot(GLM.output) + geom_segment( aes(y = 1, x = reorder(DESCRIPTION,OR), yend = OR, xend = DESCRIPTION), size=0.5,alpha=0.5) + geom_point( aes(x=reorder(DESCRIPTION,OR),y=OR), alpha=0.5,size=2,color="red") + theme_minimal()+ scale_y_continuous(limits=c(0,10))+ coord_flip()+ labs(y="Odds Ratio",x=""))
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.