In bhklab/PharmacoGxML:

knitr::opts_chunk$set(echo = TRUE)

\title{Biomarker discovery from large pharmacogenomics datasets}

\author[1,2]{Zhaleh Safikhani} \author[1, 2]{Petr Smrinov} \author[1, 2]{Seyed Ali Madani Tonekaboni} \author[1,2,3]{Benjamin Haibe-Kains}

\affil[1]{Princess Margaret Cancer Centre, University Health Network, Toronto Canada} \affil[2]{Department of Medical Biophysics, University of Toronto, Toronto Canada} \affil[3]{Department of Computer Science, University of Toronto, Toronto Canada}

\maketitle \tableofcontents

\section{Abstract} This course will focus on the challenges encountered when applying machine learning techniques in complex, high dimensional biological data. In particular, we will focus on biomarker discovery from pharmacogenomic data, which consists of developing predictors of response of cancer cell lines to chemical compounds based on their genomic features. From a methodological viewpoint, biomarker discovery is strongly linked to variable selection, through methods such as Supervised Learning with sparsity inducing norms (e.g., ElasticNet) or techniques accounting for the complex correlation structure of biological features (e.g., mRMR). Yet, the main focus of this talk will be on sound use of such methods in a pharmacogenomics context, their validation and correct interpretation of the produced results. We will discuss how to assess the quality of both the input and output data. We will illustrate the importance of unified analytical platforms, data and code sharing in bioinformatics and biomedical research, as the data generation process becomes increasingly complex and requires high level of replication to achieve robust results. This is particularly relevant as our portfolio of machine learning techniques is ever enlarging, with its set of hyperparameters that can be tuning in a multitude of ways, increasing the risk of overfitting when developing multivariate predictors of drug response.

\section{Introduction}

Pharmacogenomics holds much potential to aid in discovering drug response biomarkers and developing novel targeted therapies, leading to development of precision medicine and working towards the goal of personalized therapy. Several large experiments have been conducted, both to molecularly characterize drug dose response across many cell lines, and to examine the molecular response to drug administration. However, the experiments lack a standardization of protocols and annotations, hindering meta-analysis across several experiments.

PharmacoGx was developed to address these challenges, by providing a unified framework for downloading and analyzing large pharmacogenomic datasets which are extensively curated to ensure maximum overlap and consistency.

PharmacoGx is based on a level of abstraction from the raw experimental data, and allows bioinformaticians and biologists to work with data at the level of genes, drugs and cell lines. This provides a more intuitive interface and, in combination with unified curation, simplifies analyses between multiple datasets.

\subsection{Installation and Settings} PharmacoGx requires that several packages are installed. However, all dependencies are available from CRAN or Bioconductor.

source('http://bioconductor.org/biocLite.R')
biocLite('PharmacoGx')

library(devtools)
devtools::install_github("bhklab/mci", ref="light")
devtools::install_github("bhklab/PharmacoGx-ML")

Load PharamacoGx into your current workspace:

library(PharmacoGx, verbose=FALSE)
library(mCI, verbose=FALSE)
library(PharmacoGxML, verbose=FALSE)
library(Biobase, verbose=FALSE)

\subsection{Downloading PharmacoSet objects} We have made the PharmacoSet objects of the curated datasets available for download using functions provided in the package. A table of available PharmacoSet objects can be obtained by using the availablePSets function. Any of the PharmacoSets in the table can then be downloaded by calling downloadPSet, which saves the datasets into a directory of the users choice, and returns the data into the R session.

  availablePSets()
  GDSC <- downloadPSet("GDSC") 
  CCLE <- downloadPSet("CCLE")

\section{Reproducibility} PharmacoGx can be used to process pharmacogenomic datasets. First we want to check the heterogenity of cell lines in one of the available psets, CCLE.

mycol <- c("#8dd3c7","#ffffb3","#bebada","#fb8072","#80b1d3","#fdb462",
           "#b3de69","#fccde5","#d9d9d9","#bc80bd","#ccebc5","#ffed6f",
           "#a6cee3","#1f78b4","#b2df8a","#33a02c","#fb9a99","#e31a1c",
           "#fdbf6f","#ff7f00","#cab2d6","#6a3d9a","#ffff99","#b15928")
pie(table(CCLE@cell[,"tissueid"]), 
    col=mycol, 
    main="Tissue types", 
    radius=1, 
    cex=0.8)

\subsection{Plotting Drug-Dose Response Data} Drug-Dose response data included in the PharmacoSet objects can be conviniently plotted using the drugDoseResponseCurve function. Given a list of PharmacoSets, a drug name and a cell name, it will plot the drug dose response curves for the given cell-drug combination in each dataset, allowing direct comparisons of data between datasets.

CCLE.auc <- summarizeSensitivityProfiles(
                pSet=CCLE,
                sensitivity.measure="auc_published", 
                summary.stat="median",
                verbose=FALSE)

lapatinib.aac <- CCLE.auc["lapatinib",]
cells <- names(lapatinib.aac)[
                c(which.min(lapatinib.aac), 
                  which((lapatinib.aac > 0.2) & (lapatinib.aac < 0.4))[1],
                  which.max(lapatinib.aac))]
par(mfrow=c(2, 2))
drugDoseResponseCurve(drug="lapatinib", cellline=cells[1], 
                      pSets=CCLE, plot.type="Fitted", 
                      legends.label="auc_published")
drugDoseResponseCurve(drug="lapatinib", cellline=cells[2], 
                      pSets=CCLE, plot.type="Fitted", 
                      legends.label="auc_published")
drugDoseResponseCurve(drug="lapatinib", cellline=cells[3], 
                      pSets=CCLE, plot.type="Fitted", 
                      legends.label="auc_published")

\subsection{Pharmacological profiles} In pharmacogenomic studies Cells were also tested for their response to increasing concentrations of various compounds, and form this the IC50 and AUC were computed. These pharmacological profiles are available for all the psets in PharmacoGx.

library(ggplot2)
library(reshape2)
melted_data <- melt(CCLE.auc)
NA_rows <- unique(which(is.na(melted_data), arr.ind=T)[,1])
melted_data <- melted_data[-NA_rows,]
ggplot(melted_data, aes(x=Var1,y=value)) +
  geom_boxplot(fill="gray") +
  theme(axis.text.x=element_text(angle=90,hjust=1)) +
  xlab("Drugs") +
  ylab("AAC")
#hist(CCLE.auc["lapatinib",], xlab="Cells response to lapatinib(AAC)", 
#     col="gray", main="")

\section{Replication} In this section we will investigate the consistency between the GDSC and CCLE datasets. In both CCLE and GDSC, the transcriptome of cells was profiled using an Affymatrix microarray chip. Cells were also tested for their response to increasing concentrations of various compounds, and form this the IC50 and AUC were computed. However, the cell and drugs names used between the two datasets were not consistent. Furthermore, two different microarray platforms were used. However, PharmacoGx allows us to overcome these differences to do a comparative study between these two datasets.

GDSC was profiled using the hgu133a platform, while CCLE was profiled with the expanded hgu133plus2 platform. While in this case the hgu133a is almost a strict subset of hgu133plus2 platform, the expression information in PharmacoSet objects is summarized by Ensemble Gene Ids, allowing datasets with different platforms to be directly compared. The probe to gene mapping is done using the BrainArray customCDF for each platform \cite{sabatti_thresholding_2002}.

To begin, you would load the datasets from disk or download them using the downloadPSet function above.

We want to investigate the consistency of the data between the two datasets. The common intersection between the datasets can then be found using intersectPSet. We create a summary of the gene expression and drug sensitivity measures for both datasets, so we are left with one gene expression profile and one sensitivity profile per cell line within each dataset. We can then compare the gene expression and sensitivity measures between the datasets using a standard correlation coefficient.

common <- intersectPSet(pSets = list("CCLE"=CCLE, "GDSC"=GDSC), 
                        intersectOn = c("cell.lines", "drugs"), 
                        strictIntersect = TRUE)
drugs <- drugNames(common$CCLE)

##Example of concordant and discordant drug curves
cases <- rbind(
  c("CAL-85-1", "17-AAG"),
  c("HT-29", "PLX4720"),
  c("COLO-320-HSR", "AZD6244"),
  c("HT-1080", "PD-0332991"))

par(mfrow=c(2, 2))
for (i in 1:nrow(cases)) {
  drugDoseResponseCurve(pSets=common, 
                        drug=cases[i,2], 
                        cellline=cases[i,1], 
                        legends.label="ic50_published", 
                        plot.type="Fitted", 
                        ylim=c(0,130))
}

\subsection{Consistency of pharmacological profiles}

  ##AAC scatter plot 
GDSC.aac <- summarizeSensitivityProfiles(
    pSet=common$GDSC,
    sensitivity.measure='auc_recomputed', 
    summary.stat="median",
    verbose=FALSE)
CCLE.aac <- summarizeSensitivityProfiles(
  pSet=common$CCLE,
  sensitivity.measure='auc_recomputed', 
  summary.stat="median",
  verbose=FALSE)

GDSC.ic50 <- summarizeSensitivityProfiles(
    pSet=common$GDSC, 
    sensitivity.measure='ic50_recomputed', 
    summary.stat="median",
    verbose=FALSE)
CCLE.ic50 <- summarizeSensitivityProfiles(
  pSet=common$CCLE, 
  sensitivity.measure='ic50_recomputed', 
  summary.stat="median",
  verbose=FALSE)

drug <- "lapatinib"
#par(mfrow=c(1, 2))
myScatterPlot(x=GDSC.aac[drug,], 
              y=CCLE.aac[drug,], 
              method=c("transparent"), 
              transparency=0.8, pch=16, minp=50, 
              xlim=c(0, max(max(GDSC.aac[drug,], na.rm=T), max(CCLE.aac[drug,], na.rm=T))), 
              ylim=c(0, max(max(GDSC.aac[drug,], na.rm=T), max(CCLE.aac[drug,], na.rm=T))), 
              main="cells response to lapatinib", 
              cex.sub=0.7, 
              xlab="AAC in GDSC", 
              ylab="AAC in CCLE")
legend("topright",
         legend=sprintf("r=%s\nrs=%s\nCI=%s", 
                 round(cor(GDSC.aac[drug,], 
                           CCLE.aac[drug,], 
                           method="pearson",
                           use="pairwise.complete.obs"), 
                       digits=2),
                 round(cor(GDSC.aac[drug,], 
                           CCLE.aac[drug,], 
                           method="spearman",
                           use="pairwise.complete.obs"), 
                       digits=2),
                 round(paired.concordance.index(GDSC.aac[drug,],
                                                CCLE.aac[drug,],
                                                delta.pred=0,
                                                delta.obs=0)$cindex, 
                       digits=2)), 
         bty="n")

\subsection{consistency assessment improved by Modified Concordance Index} To better assess the concordance of multiple pharmacogenomic studies we introduced the modified concordance index (mCI). Recognizing that the noise in the drug screening assays is high and may yield to inaccurate sensitive-based ranking of cell lines with close AAC values, the mCI only considers cell line pairs with drug sensitivity (AAC) difference greater than $\delta$ .

c_index <-  mc_index <- NULL
for(drug in drugs){
  tt <- mCI::paired.concordance.index(GDSC.aac[drug,], CCLE.aac[drug,], delta.pred=0, delta.obs=0, alternative="greater")
  c_index <- c(c_index, tt$cindex)
  tt <- mCI::paired.concordance.index(GDSC.aac[drug,], CCLE.aac[drug,], delta.pred=0.2, delta.obs=0.2, alternative="greater", logic.operator="or")
  mc_index <- c(mc_index, tt$cindex)
}
mp <- barplot(as.vector(rbind(c_index, mc_index)), beside=TRUE, col=c("blue", "red"), ylim=c(0, 1), ylab="concordance index", space=c(.15,.85), border=NA, main="mCI")
text(mp, par("usr")[3], labels=as.vector(rbind(drugs, rep("", 15))), srt=45, adj=c(1.1,1.1), xpd=TRUE, cex=.8)
abline(h=.7, lty=2)

\subsection{Known Biomarkers} The association between molecular features and response to a given drug is modelled using a linear regression model adjusted for tissue source: $$Y = \beta_{0} + \beta_{i}G_i + \beta_{t}T + \beta_{b}B$$ where $Y$ denotes the drug sensitivity variable, $G_i$, $T$ and $B$ denote the expression of gene $i$, the tissue source and the experimental batch respectively, and $\beta$s are the regression coefficients. The strength of gene-drug association is quantified by $\beta_i$, above and beyond the relationship between drug sensitivity and tissue source. The variables $Y$ and $G$ are scaled (standard deviation equals to 1) to estimate standardized coefficients from the linear model. Significance of the gene-drug association is estimated by the statistical significance of $\beta_i$ (two-sided t test). P-values are then corrected for multiple testing using the false discovery rate (FDR) approach.

As an example of the reproducibility of biomarker discovery across pharmacogenomic studies, we can model the significance of the association between two drugs and their known biomarkers in CCLE and GDSC. We examine the association between drug 17-AAG and gene NQO1, as well as drug PD-0325901 and gene BRAF:

``` {r biomarker_discovery, results='asis'} features <- PharmacoGx::fNames(CCLE, "rna")[ which(featureInfo(CCLE, "rna")$Symbol == "NQO1")] ccle.sig.rna <- drugSensitivitySig(pSet=CCLE, mDataType="rna", drugs=c("17-AAG"), features=features, sensitivity.measure="auc_published", molecular.summary.stat="median", sensitivity.summary.stat="median", verbose=FALSE) gdsc.sig.rna <- drugSensitivitySig(pSet=GDSC, mDataType="rna", drugs=c("17-AAG"), features=features, sensitivity.measure="auc_published", molecular.summary.stat="median", sensitivity.summary.stat="median", verbose=FALSE) ccle.sig.mut <- drugSensitivitySig(pSet=CCLE, mDataType="mutation", drugs=c("PD-0325901"), features="BRAF", sensitivity.measure="auc_published", molecular.summary.stat="and", sensitivity.summary.stat="median", verbose=FALSE) gdsc.sig.mut <- drugSensitivitySig(pSet=GDSC, mDataType="mutation", drugs=c("PD-0325901"), features="BRAF", sensitivity.measure="auc_published", molecular.summary.stat="and", sensitivity.summary.stat="median", verbose=FALSE) ccle.sig <- rbind(ccle.sig.rna, ccle.sig.mut) gdsc.sig <- rbind(gdsc.sig.rna, gdsc.sig.mut) known.biomarkers <- cbind("GDSC effect size"=gdsc.sig[,1], "GDSC pvalue"=gdsc.sig[,6], "CCLE effect size"=ccle.sig[,1], "CCLE pvalue"=ccle.sig[,6]) rownames(known.biomarkers) <- c("17-AAG + NQO1","PD-0325901 + BRAF") library(xtable) xtable(known.biomarkers, digits=c(0, 2, -1, 2, -1), caption='Concordance of biomarkers across stuudies') par(mfrow=c(2, 2)) CCLE_expr <- t(exprs(summarizeMolecularProfiles(CCLE, mDataType="rna", fill.missing=FALSE))) CCLE_cells <- intersect(rownames(CCLE_expr), colnames(CCLE.aac)) plot(CCLE.aac["17-AAG", CCLE_cells], CCLE_expr[CCLE_cells, features], main="CCLE + 17-AAG + NQO1", cex.main=1, ylab="Predictions", xlab="drug sensitivity", pch=20, col="gray40")

GDSC_expr <- t(exprs(summarizeMolecularProfiles(GDSC, mDataType="rna", fill.missing=FALSE)))
GDSC_cells <- intersect(rownames(GDSC_expr), colnames(GDSC.aac))
plot(GDSC.aac["17-AAG", GDSC_cells], GDSC_expr[GDSC_cells, features],
 main="GDSC + 17-AAG + NQO1",
 cex.main=1, ylab="Predictions", xlab="drug sensitivity", pch=20, col="gray40")

CCLE_mut <- t(exprs(summarizeMolecularProfiles(CCLE, mDataType="mutation", fill.missing=FALSE, summary.stat="or")))

CCLE_cells <- intersect(rownames(CCLE_mut), colnames(CCLE.aac)) boxplot(CCLE.aac["PD-0325901", CCLE_cells]~ CCLE_mut[CCLE_cells, "BRAF"], col="gray80", pch=20, main="CCLE + PD-0325901 + BRAF", cex.main=1, xlab="mutation", ylab="drug sensitivity")

GDSC_mut <- t(exprs(summarizeMolecularProfiles(GDSC, mDataType="mutation", fill.missing=FALSE, summary.stat="or"))) GDSC_cells <- intersect(rownames(GDSC_mut), colnames(GDSC.aac)) boxplot(GDSC.aac["PD-0325901", GDSC_cells]~ GDSC_mut[GDSC_cells, "BRAF"], col="gray80", pch=20, main="GDSC + PD-0325901 + BRAF", cex.main=1, xlab="mutation", ylab="drug sensitivity")

\subsection{Machne Learning and Biomarker Discovery}
Some of the widely used multivariate machine learning methods such as elastic net, Random Forest (RF) and Support Vector Machine (SVM) have been already implemented in the MLWorkshop. It optimizes hyperparameters of these methods in the training phase. To assess the performance of the predictive models, it implements *m* number of sampling with *n-fold* cross validations (CV). The performance will then be assessed by multiple metrics including pearson correlation coefficient, concordance index and modified concordance index. 


```r 
library(mRMRe, verbose=FALSE)
library(Biobase, verbose=FALSE)
library(Hmisc, verbose=FALSE)
library(glmnet, verbose=FALSE)
library(caret, verbose=FALSE)
library(randomForest, verbose=FALSE)

##Preparing trainig dataset
train_expr <- t(exprs(summarizeMolecularProfiles(GDSC, mDataType="rna", fill.missing=FALSE, verbose=FALSE)))
aac <- summarizeSensitivityProfiles(GDSC, sensitivity.measure="auc_recomputed", drug="lapatinib", fill.missing=FALSE, verbose=FALSE)
cells <- intersect(rownames(train_expr), names(aac))
df <- as.matrix(cbind(train_expr[cells,], "lapatinib"=aac[cells]))


##Preparing validation dataset
validation_expr <- summarizeMolecularProfiles(CCLE, mDataType="rna", fill.missing=FALSE, verbose=FALSE)
actual_labels <- summarizeSensitivityProfiles(CCLE, sensitivity.measure="auc_recomputed", drug="lapatinib", fill.missing=FALSE, verbose=FALSE)


for(method in c("ridge", "lasso", "random_forest", "svm")){
  par(mfrow=c(2, 1))
  res <- optimization(train=df[, -ncol(df), drop=F],
                      labels=t(df[, ncol(df), drop=F]),
                      method=method,
                      folds.no=5,
                      sampling.no=1,
                      features.no=100,
                      feature.selection="mRMR",
                      assessment=c("corr", "mCI"))

  validation_labels <- prediction(model=res$model$lapatinib,
                                  validation.set=t(exprs(validation_expr)),
                                  validation.labels=actual_labels,
                                  method=method,
                                  assessment="mCI")

}