knitr::opts_chunk$set(collapse = TRUE, comment = "#>")






CoRe R package interactive vignette



Introduction

The CoRe package implements two methods for identification of core fitness genes (at two level of stringency) from joint analyses of multiple genome-wide CRISPR-Cas9 screens:

1) The percentile ranking method builds on the starting basic intuition that if a gene is involved in essential processes in every cell, it should be among the top essential genes in all the analysed screens, even those performed on the cell lines whose viability is lowly impacted by the knock-out of that gene, i.e. the least dependent cell lines. The CoRe package implements four variants of this method, which verify if the starting assumption is satisfied for (considering each gene in turn) in different ways. The first variant accounts for the distribution of the rank position of the genes in their 90th percentile least dependent cell lines, obtained by sorting all genes based on their essentiality in that cell line. The second considers the distribution of the average gene rank positions in all the cell lines falling at least in 90th percentile cell lines of least dependent ones. The third version fits a linear model on the curve obtained when sorting all cell lines based on how much they are dependent on a given gene then considering the essentiality rank position of that gene across all cell lines (in the obtained order). The fourth version considers the area under curve (AUC) of gene ranks across ranked dependent cell lines. The density of the considered distributions is then estimated using a Gaussian kernel with width 0.1 and the point of minimum density identified. Genes falling below the point of minimum density are classified as common essential [1].

2) The Adaptive Daisy Model (ADaM) is a semi-supervised algorithm for computing a fuzzy-intersection of non-fuzzy sets by adaptively determining the minimal number of sets to which an element should belong in order to be a member of the fuzzy-intersection (the membership threshold). This threshold maximises the deviance from expectation of the cardinality of the resulting fuzzy-intersection, as well as the coverage of predefined elements. This method can be used to identify the minimal number of cell lines from a given tissue in which the inactivation of a gene (via CRISPR-Cas9 targeting) should exert a reduction of viability (or fitness effect) in order for that gene to be considered a core-fitness essential gene for the tissue under consideration. Iterating this method and considering cancer-type specific core fitness genes as sets to be fuzzy-intersected, then this method can be used to estimate a set of pan-cancer core fitness genes. This method is used to discriminate between core-fitness and context-specific essential genes in a study describing a large scale genome-wide CRISPR-Cas9 pooled drop-out screening [1] (a detailed description of the algorithm is included in the Supplemental Information of [2]). ADaM was inspired by the Daisy Model method introduced in [3]

License: GNU GPL v3

Contributors: Emre Karakoc, Clare Pacini, Alessandro Vinceti and Francesco Iorio

References

[1] J. M. Dempster et al., Agreement between two large pan-cancer CRISPR-Cas9 gene dependency data sets., Nat. Commun., vol. 10, no. 1, p. 5817, 2019, doi: 10.1038/s41467-019-13805-y.

[2] Behan FM & Iorio F & Picco G et al., Prioritisation of cancer therapeutic targets using CRISPR-Cas9 screens. Nature, In press.

[3] Hart T et al., High-Resolution CRISPR Screens Reveal Fitness Genes and Genotype-Specific Cancer Liabilities. Cell. 2015;163:1515–26.

Running Modalities

CoRe is an R package available at: https://github.com/DepMap-Analytics/CoRe.

This page contains instruction to quickly try the package. User manual and package documentation are available at https://github.com/DepMap-Analytics/CoRe/blob/master/Reference_Manual.pdf.

R package: quick start

Package installation

The R package is available on github at https://github.com/DepMap-Analytics/CoRe. We recommend to use it within Rstudio (https://www.rstudio.com/). To install it the following commands should be executed:

library(devtools)
install_github("DepMap-Analytics/CoRe")
library(CoRe)

This will install the following additional libraries:

magrittr, RCurl, readr, pheatmap, stringr

all publicly available on CRAN.

The package comes with built-in data objects containing predefined sets of BAGEL essential and non-essential genes (from Hart et al, Cell 2015), curated BAGEL genes (as defined in Behan et al, Nature 2019) other sets of a priori known essential genes (from http://www.gsea-msigdb.org/gsea/msigdb/index.jsp).

Execute ADaM pipeline in CoRe package step by step

In this case scenario, we are going to execute the ADaM method step by step. First, we have to download the Binary Event Matrix (BEM) from the Project Score website (https://score.depmap.sanger.ac.uk/). The matrix illustrates binarized CRISPR-Cas9 screen data performed on over 300 cell lines where 1 points out the essentiality of that gene for cell viability. In our case, we want to focus on cell lines belonging to Non-Small Cell Lung Carcinoma.

library(CoRe)

BinDepMat<-CoRe.download_BinaryDepMatrix()

LungDepMap<-CoRe.extract_tissueType_SubMatrix(BinDepMat,
                                              tissue_type="Non-Small Cell Lung Carcinoma")

However, the matrix can be subsetted to any tissues/cancer type of liking that can be found on the Cell Model Passport (https://cellmodelpassports.sanger.ac.uk/). We only recommend to focus on tissues/cancer types having at least 8 cell lines for a proper working of ADaM. Below are shown the available cancer tissues:

clannotation<-CoRe.download_AnnotationModel()

names(which(table(clannotation$tissue) >= 8))

Below are shown the available cancer types:

names(which(table(clannotation$cancer_type) >= 8))

Once we have the BEM, the first step in the ADaM pipeline will be to generate the profiles of the number of fitness genes across number of cell lines from observed data and corresponding cumulative sums. Next, ADaM will generate a set of random profiles by perturbing observed data. This allows to compute log10 odd ratios of observed/expected profiles of cumulative number of fitness genes in fixed number of cell lines.

pprofile<-CoRe.panessprofile(depMat=LungDepMap)

nullmodel<-CoRe.generateNullModel(depMat=LungDepMap,
                                  ntrials = 1000)

EO<-CoRe.empiricalOdds(observedCumSum = pprofile$CUMsums,
                       simulatedCumSum =nullmodel$nullCumSUM)

Calculate True Positive Rate (TPR) for fitness genes in at least n cell lines in the observed dependency matrix, with positive cases from a reference set of essential genes. We recommend using the curated BAGEL essential compared to BAGEL essential set in order to avoid any high-confidence cancer driver genes to have their status (essential or non-essential) being defined a priori. Others datasets are available by typing the option"??CoRe::datasets".

data(curated_BAGEL_essential)

TPR<-CoRe.truePositiveRate(LungDepMap,
                           curated_BAGEL_essential)

Given a cancer type made up of m cell lines, the crossover point is defined as the maximal value providing the best trade-off between TPR (inversely proportional to m) and the log10 ratio of observed/expected profiles (proportional to m). All the genes that are essential in a number of cell lines equals or greater than the crossover point are defined as core-fitness.

crossoverpoint<-CoRe.tradeoffEO_TPR(EO,
                                    TPR$TPR,
                                    test_set_name = 'curated BAGEL essential')

coreFitnessGenes<-CoRe.coreFitnessGenes(LungDepMap,
                                        crossoverpoint)

Execute previous pipeline using ad-hoc CoRe function for specific tissues/cancer types

The ADaM pipeline shown above is embedded in a single built-in function. In this case, we just need to download the dataset from the Cell Model Passport, choose the tissues/cancer type we want and run the function. Below, we are still considering the Non-Small Cell Lung Carcinoma.

clannotation<-CoRe.download_AnnotationModel()

SNCLC_CFgenes<-CoRe.CS_ADaM(BinDepMat,tissue_ctype = 'Non-Small Cell Lung Carcinoma',
                             clannotation = clannotation,
                             TruePositives = curated_BAGEL_essential)

However, the same holds true for the other options.

BRCA_CFgenes<-CoRe.CS_ADaM(BinDepMat,tissue_ctype = 'Breast',
                            clannotation = clannotation,
                            TruePositives = curated_BAGEL_essential)
CRC_CFgenes<-CoRe.CS_ADaM(BinDepMat,tissue_ctype = 'Large Intestine',
                           clannotation = clannotation,
                           TruePositives = curated_BAGEL_essential)

Execute ADaM method to compute Pancancer core-fitness genes

So far, we have focused on tissues/cancer types but ADaM can be also used for identifying pan-cancer core-fitness genes as described in Behan et al 2019, i.e. performing analyses at individual tissues/cancer-type level then collapsing results at pan-cancer level. The CoRe package already provides the user with a built-in function to automate the process. The user only needs to specify the tissues/cancer types to include in the analysis.

tissues_ctypes<-c("Haematopoietic and Lymphoid",
                  "Ovary",
                  "Peripheral Nervous System",
                  "Central Nervous System",
                  "Pancreas",
                  "Head and Neck",
                  "Bone",
                  "Lung",
                  "Large Intestine",
                  "Esophagus",
                  "Endometrium",
                  "Stomach",
                  "Breast")

PanCancer_CFgenes<-
  CoRe.PanCancer_ADaM(pancan_depMat = BinDepMat,
                      tissues_ctypes = tissues_ctypes,
                      clannotation = clannotation,
                      TruePositives = curated_BAGEL_essential,
                      display = FALSE)

head(PanCancer_CFgenes)

Visualization of CFness of a gene

The resulting pan-cancer core-fitness genes can be compared against positive and negative control genes. To do that, we have to download the quantitative dependency matrix from Project Score. This matrix contains the log fold changes associated with each gene in each screened cell lines. The BEM we used previously is derived from this matrix after certain post-processing steps illustrated in Behan et al., 2019.

depMat<-CoRe.download_DepMatrix(scaled = TRUE, 
                                ess = curated_BAGEL_essential, 
                                noness = curated_BAGEL_nonEssential)

PanCoreGene<-PanCancer_CFgenes[1]

CoRe.VisCFness(depMat,
               PanCoreGene,
               percentile=0.9,
               posControl='RPL12',
               negControl='MAP2K1')

Execute 90th Percentile method

As explained in the introduction, the 90th percentile method can be executed according to four different criteria:
- Fixed: it accounts for the distribution of the rank position of the genes in their 90th percentile least dependent cell lines.

CFgenes<-CoRe.PercentileCF(depMat,
                           method = 'fixed')
CFgenesAVG<-CoRe.PercentileCF(depMat,
                              method = 'average')
CFgenesSLOPE<-CoRe.PercentileCF(depMat,
                                method = 'slope')
CFgenesAUC<-CoRe.PercentileCF(depMat,
                               method = 'AUC')

Benchmarking of core-fitness genes

The core-fitness genes can be benchmarked against a priori known True and False Positives (FP) set of genes and other independent essential genes from the MsigDB. Particularly, the FPs set is obtained from the CCLE expression dataset of the DepMap portal (https://depmap.org/portal/download/). Genes that are lowly expressed consistently across cell lines are considered as false positive if misclassified as core-fitness.
IMPORTANT: the 90th percentile method can be executed also on a specific tissue or cancer type, please use the CoRe.extract_tissueType_SubMatrix function to extract the submatrix as shown for the ADaM method.

- Below the benchmarking of pan-cancer core-fitness genes computed by ADaM is shown:

data(EssGenes.DNA_REPLICATION_cons)
data(EssGenes.HISTONES)
data(EssGenes.KEGG_rna_polymerase)
data(EssGenes.PROTEASOME_cons)
data(EssGenes.SPLICEOSOME_cons)
data(EssGenes.ribosomalProteins)

signatures<-list(DNA_REPLICATION=EssGenes.DNA_REPLICATION_cons,
                 HISTONES=EssGenes.HISTONES,
                 RNA_POLYMERASE=EssGenes.KEGG_rna_polymerase,
                 PROTEASOME=EssGenes.PROTEASOME_cons,
                 SPLICEOSOME=EssGenes.SPLICEOSOME_cons,
                 RIBOSOMAL_PROTS=EssGenes.ribosomalProteins)

FPs<-CoRe.AssembleFPs()

ADaMperfClassic<-CoRe.CF_Benchmark(PanCancer_CFgenes,
                                    background = rownames(depMat),
                                    priorKnownSignatures = signatures,
                                    falsePositives=FPs)
ADaMperf<-CoRe.CF_Benchmark(CFgenes$cfgenes,
                            background = rownames(depMat),
                            priorKnownSignatures = signatures,
                            falsePositives=FPs)
ADaMperfAVG<-CoRe.CF_Benchmark(CFgenesAVG$cfgenes,
                                background = rownames(depMat),
                                priorKnownSignatures = signatures,
                                falsePositives=FPs)
ADaMperfSLOPE<-CoRe.CF_Benchmark(CFgenesSLOPE$cfgenes,
                                  background = rownames(depMat),
                                  priorKnownSignatures = signatures,
                                  falsePositives=FPs)
ADaMperfAUC<-CoRe.CF_Benchmark(CFgenesAUC$cfgenes,
                                background = rownames(depMat),
                                priorKnownSignatures = signatures,
                                falsePositives=FPs)


DepMap-Analytics/CoRe documentation built on July 6, 2022, 8:01 a.m.