In vjcitn/BiocAIML: AI and ML excursions with Bioconductor and tidymodels

Introduction

Machine learning methods have long been central to computational biology. The BiocAIML package aims to illustrate the use of new approaches to machine learning with R in the context of genome research.

Taster: classification in glioblastoma multiforme (GBM)

Data setup

An illustrative dataset derived from the Cancer Genome Atlas (TCGA) is available with the BiocAIML package. This dataset will be retrieved from the cloud using r BiocStyle::Biocpkg("curatedTCGAData"), massaged to include clinical data published in @Brennan2013 and cached for future use.

suppressPackageStartupMessages({
library(BiocAIML)
library(survival)
library(rpart)
library(SummarizedExperiment)
})
gbmse = build_gbm_se()
gbmse

Sanity check

As a sanity check, we show that MGMT methylation status is associated with longer survival times in this dataset.

xm = gbmse[, gbmse$mgmt_status !="" & gbmse$vital_status !=""]
ss = Surv(xm$os_days, 1*(xm$vital_status=="DECEASED"))
plot(survfit(ss~xm$mgmt_status), lty=1:2)
legend(900, .95, lty=c(1,2), legend=c("MGMT methylated", "unmethylated"))
title("Time-on-study/vital status for 123 GBM patients\nanalyzed in Brennan et al. PMID 24120142")

Classification with randomly chosen features

Let's pick a random sample of 100 genes and classify the 'expression-based' subtype of GBM using r CRANpkg("survival")'s rpart.

set.seed(1234)
xms = xm[sample(seq_len(nrow(xm)), size=100),]
xmsdf = data.frame(cl=xms$expression_subclass, t(assay(xms)))
rp1 = rpart(cl~., data=xmsdf)
tt = table(predicted=predict(rp1, type="class"), given=na.omit(xmsdf$cl))
tt

There are r sum(tt)-sum(diag(tt)) "errors" in 122 predictions. The associated tree is:

plot(rp1)
text(rp1)

and the cross-validated error profile is

plotcp(rp1)

This indicates that the best tree obtainable with these features (100 randomly sampled genes) has 8 nodes and a relative error (compared to declaring all patients to have the majority class) of around 80%.