In najha/CELLector: Genomics guided selection of cancer cell lines

title: "CELLector: Genomics Guided Selection of Cancer in vitro models" author: "Hanna Najgebauer and Francesco Iorio" subtitle: 'R package interactive vignette' date: "r Sys.Date()" output: rmarkdown::html_vignette: toc: true toc_depth: 3 vignette: > %\VignetteIndexEntry{Vignette Title} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8}

knitr::opts_chunk$set(collapse = TRUE, comment = "#>")

CELLector R package interactive vignette

CELLector manuscript example study case

CELLector App online tutorial

Introduction

CELLector is a computational tool assisting experimental scientists in the selection of the most clinically relevant cancer cell lines to be included in a new in-vitro study (or to be considered in a retrospective study), in a patient-genomic guided fashion.

CELLector combines methods from graph theory and market basket analysis; it leverages tumour genomics data to explore, rank, and select optimal cell line models in a user-friendly way, through the CELLector Rshiny App. This enables making appropriate and informed choices about model inclusion/exclusion in retrospective analyses, future studies and it makes possible bridging cancer patient genomics with public available databases from cell line based functional/pharmacogenomic screens, such as CRISPR-cas9 dependency datasets and large-scale in-vitro drug screens.

Furthermore, CELLector includes interface functions to synchronise built-in cell line annotations and genomics data to their latest versions from the Cell Model Passports resource. Through this interface, bioinformaticians can quickly generate binary genomic event matrices (BEMs) accounting for hundreds of cancer cell lines, which can be used in systematic statistical inferences, associating patient-defined cell line subgroups with drug-response/gene-essentiality, for example through GDSC tools.

Additionally, CELLector allows the selection of models within user-defined contexts, for example, by focusing on genomic alterations occurring in biological pathways of interest or considering only predetermined sub-cohorts of cancer patients.

Finally, CELLector identifies combinations of molecular alterations underlying disease subtypes currently lacking representative cell lines, providing guidance for the future development of new cancer models.

License: GNU GPL v3

Najgebauer, H., Yang, M., Francies, H., Pacini, C., Stronach, E. A., Garnett, M. J., Saez-Rodriguez, J., & Iorio, F. CELLector: Genomics Guided Selection of Cancer in vitro Models. https://www.biorxiv.org/content/10.1101/275032v3

Running Modalities

CELLector can be used in three different modalities:

(i) as an R package (within R, code available at: https://github.com/francescojm/CELLector),
(ii) as an online R shiny App (available at: https://ot-cellector.shinyapps.io/CELLector_App/),
(iii) running the R shiny App locally (within Rstudio, code available at: https://github.com/francescojm/CELLector_App).

This page contains instruction to quickly try the package. User manual and package documentation are available at https://github.com/francescojm/CELLector/blob/master/CELLector.pdf.

A tutorial on how to use the online Rshiny app (containing also instructions on how to run it locally) is available here

R package: quick start

Package installation

To R package is available on github at https://github.com/francescojm/CELLector. We recommend to use it within Rstudio (https://www.rstudio.com/). To install it the following commands should be executed:

library(devtools)
install_github("Francescojm/CELLector")
library(CELLector)

This will install the following additional libraries

arules, dplyr, stringr, data.tree, sunburstR, igraph, collapsibleTree, methods

all publicly available on CRAN or Bioconductor. The package comes with built-in data objects containing genomic data for large cohorts of primary tumours and cancer cell lines (from Iorio et al, Cell 2016). Optionally, users can build their own genomic binary event matrices (BEMs) from their data or regenerating the built-in data objects using data from Iorio et al, Cell 2016 but using customised filters for cancer driver genes and variants. Last, cell line BEMs can be reassembled using annotations and data from the Cell Model Passports, through a dedicated module.

Selecting Colorectal Cancer Cell Lines that are representative of a defined patients' sub-group

In this simple case study scenario, we want to select the 5 most clinically relevant in vitro models that best represent the genomic diversity of TP53 mutated colorectal tumours. The models we wish to select should be microsatellite stable and harbour at least one alteration in the following signalling pathways: PI3K-AKT-MTOR signalling, RAS-RAF-MEK-ERK/JNK signalling and WNT signalling. Finally, we want the model selection to be guided based on somatic mutations and copy number alterations that are observed in at least 3% of the studied TP53 mutant tumour cohort.

After loading the package, we need to load few data objects

library(CELLector)

## Somatic mutations and copy number alterations found in primary tumours and cell lines
data(CELLector.PrimTum.BEMs)
data(CELLector.CellLine.BEMs)

## Sets of Cancer Functional Events (CFEs: somatic mutations and copy number alterations) involving
## genes in predefined key cancer pathways
data(CELLector.Pathway_CFEs)

## Objects used for decoding cna CFEs identifiers 
data(CELLector.CFEs.CNAid_mapping)
data(CELLector.CFEs.CNAid_decode)

Subsequently we specify the Colon/Rectal Adenocarcinoma (COREAD) TCGA label to select the cancer type to analyse. Other available cancer types in this version of the package are: BLCA, BRCA, COREAD, GBM, HNSC, KIRC, LAML, LGG, LUAD, LUSC, OV, PRAD, SKCM, STAD, THCA, UCEC. Finally, we remove possible sample identifier duplications from the primary tumour dataset.

### Change the following two lines to work with a different cancer type
tumours_BEM<-CELLector.PrimTum.BEMs$COREAD
CELLlineData<-CELLector.CellLine.BEMs$COREAD


### unicize the sample identifiers for the tumour data
tumours_BEM<-CELLector.unicizeSamples(tumours_BEM)

At this point, we are ready to build the CELLector searching space, which will contain the most recurrent patients subtypes with matched signatures of cancer functional events (CFEs, defined in Iorio et al, Cell 2016). The value of the arguments of the below function specify that we want to look only at TP53 mutant cancer patients, with alterations in genes belonging to three different cancer pathways.

### building a CELLector searching space focusing on three pathways
### and TP53 mutant patients only
CSS<-CELLector.Build_Search_Space(ctumours = t(tumours_BEM),
                                  verbose = FALSE,
                                  minGlobSupp = 0.03,
                                  cancerType = 'COREAD',
                                  pathwayFocused = c("RAS-RAF-MEK-ERK / JNK signaling",
                                                     "PI3K-AKT-MTOR signaling",
                                                     "WNT signaling"),
                                  pathway_CFEs = CELLector.Pathway_CFEs,
                                  cnaIdMap = CELLector.CFEs.CNAid_mapping,
                                  cnaIdDecode = CELLector.CFEs.CNAid_decode,
                                  cdg = CELLector.HCCancerDrivers,
                                  subCohortDefinition='TP53')

The searching space is stored in a list, whose element can be inspected as follows:

### visualising the CELLector searching space as a binary tree
CSS$TreeRoot

### visualising the first attributes of the tree nodes
CSS$navTable[,1:11]

### visualising the sub-cohort of patients whose genome satisfies the rule of the 4th node
str_split(CSS$navTable$positivePoints[4],',')

The searching space can be also interactively explored as a collapsible three or a sunburst.

### visualising the CELLector searching space as interactive collapsible binary tree
CELLector.visualiseSearchingSpace(searchSpace = CSS,CLdata = CELLlineData)

### visualising the CELLector searching space as interactive sunburst
CELLector.visualiseSearchingSpace_sunBurst(searchSpace = CSS)

Finally 10 of the most representative cell lines can be selected with the following command:

### take all the signatures from the searching space
Signatures <- CELLector.createAllSignatures(CSS$navTable)

### mapping the cell lines on the CELLector searching space
ModelMat<-CELLector.buildModelMatrix(Signatures$ES,CELLlineData,CSS$navTable)


### selecting 10 cell lines
selectedCellLines<-CELLector.makeSelection(modelMat = ModelMat,
                                           n=10,
                                           searchSpace = CSS$navTable)

knitr::kable(selectedCellLines,align = 'l')

Scoring the clinical relevance of cancer cell lines

With CELLector is possible to quantify the quality of each cell line in terms of its ability to represent an entire cohort of considered disease matching patients. This is quantified as a trade-off between two factors. The first factor is the length of the CELLector signatures (in terms of number of composing individual alterations) that are present in the cell line under consideration. This is proportional to the granularity of the representative ability of the cell line, i.e. the longest the signature the more precisely defined is the represented sub-cohort of patients. The second factor is the size of the patient subpopulation represented by the signatures that can be observed in the cell line under consideration, thus accounting for the prevalence of the sub-cohort modeled by that cell line. An input parameter allows for these two factors to be weighted equally or differently.

In this example, we want to score the quality of the cell lines in the colorectal cancer panel, based on the searching space assembled in the previous examples.

### Scoring colorectal cancer cell lines based on the searching space assembled in the previous examples
CSscores<-CELLector.Score(NavTab = CSS$navTable,CELLlineData = CELLlineData)

### Visualising the best 3 cell lines according to the considered searching space and criteria
knitr::kable(CSscores[1:3,],align = 'l')

In this other example, we want to recompute the cell line scores but prioritising the size of the sub-cohort represented by each cell line over how much it is precisely defined.

### Scoring colorectal cancer cell lines based on the searching space assembled in the previous examples
CSscores<-CELLector.Score(NavTab = CSS$navTable,CELLlineData = CELLlineData,alpha = 0.10)

### Visualising the best 3 cell lines according to the considered searching space and criteria
knitr::kable(CSscores[1:3,],align = 'l')

Assembling Genomic Binary Event Matrices (BEMs) and selecting cell line using customised genomic data

In the following example, we show how it is possible to run CELLector analyses and cell line selections from user defined genomic data. To this aim we will first generate genomic binary event matrices (BEMs) for Colorectal Cancer patients and cell lines using somatic variant catalogues derived from the TCGA (as presented in Iorio et al, Cell 2016) and from the latest installment of the Cell Model Passports, respectively for patients and cell lines, pre-selecting sets of cancer driver genes and variants to be considered in these catalogues. Fully user defined genomic variants catalogues can also be used. In this example we will particularly focus on human derived microsatellite stable diploid colorectal carcinoma cell lines resected from male patients.

First we build the BEMs as explained, with the following commands

## loading a set of high-confidence cancer driver genes from Iorio et al, Cell 2016
data(CELLector.HCCancerDrivers)

## loading a set of variants observed it at least two patients in COSMIC
data(CELLector.RecfiltVariants)

## Assembling a genomic binary event matrix (BEM) for human derived microsatellite stable diploid colorectal carcinoma cell lines,
## resected from male patients using genomic data from the Cell Model Passports, considering only variants observed in COSMIC in at lest two patients
## in high-confidence cancer driver genes
COREAD_cl_BEM <- CELLector.CELLline_buildBEM(
                  Tissue='Large Intestine',
                  Cancer_Type = 'Colorectal Carcinoma',
                  msi_status_select = 'MSI',
                  ploidy_th = c(2,2),
                  GenesToConsider = CELLector.HCCancerDrivers,
                  VariantsToConsider = CELLector.RecfiltVariants)

## showing first entries of the BEM
head(COREAD_cl_BEM)

## bar diagram with mutation frequencies in the BEM for 20 top frequently mutated genes
barplot(sort(colSums(COREAD_cl_BEM[,3:ncol(COREAD_cl_BEM)]),decreasing=TRUE)[1:20],
        las=2,ylab='n. mutated cell lines')

## Assembling BEM for Colorectal Adenocarcinoma (COAD/READ) primary tumours using data from the TCGA as presented in Iorio et al 2016,
## considering only variants observed in COSMIC in at lest two patients in high-confidence cancer driver genes
COREAD_tum_BEM<-
  CELLector.Tumours_buildBEM(Cancer_Type = 'COAD/READ',GenesToConsider = CELLector.HCCancerDrivers,
                             VariantsToConsider = CELLector.RecfiltVariants)

## showing first 25 entries of the BEM
COREAD_tum_BEM[1:5,1:5]

## showing a bar diagram with mutation frequency of 30 top frequently altered genes
barplot(100*sort(rowSums(COREAD_tum_BEM),
                 decreasing=TRUE)[1:30]/ncol(COREAD_tum_BEM),
                 las=2,ylab='% patients')

Next, we assemble a CELLector searching space and select representative cell lines as for the previous examples.

### Assembling the searching space and visualising it as interactive sunburst

COREAD_tum_BEM<-CELLector.unicizeSamples(COREAD_tum_BEM)
CSS<-CELLector.Build_Search_Space(ctumours = t(COREAD_tum_BEM),
                                  verbose = FALSE,
                                  minGlobSupp = 0.02,
                                  cancerType = 'COREAD')

CELLector.visualiseSearchingSpace_sunBurst(CSS)

### take all the signatures from the searching space
Signatures <- CELLector.createAllSignatures(CSS$navTable)

### mapping the cell lines on the CELLector searching space
ModelMat<-CELLector.buildModelMatrix(Signatures$ES,COREAD_cl_BEM,CSS$navTable)


### selecting 5 representative cell lines
selectedCellLines<-CELLector.makeSelection(modelMat = ModelMat,
                                           n=5,
                                           searchSpace = CSS$navTable)

knitr::kable(selectedCellLines,align = 'l')