CELLector.Build_Search_Space: CELLector search space construction
In najha/CELLector: Genomics guided selection of cancer cell lines

CELLector.Build_Search_Space

R Documentation

CELLector search space construction

Description

This function assembles a user defined CELLector search space analysing genomic data from a larg cohort of cancer patients (specified in input). It identifies recurrent subtypes with matched genomic signatures (as combination of cancer functional events (CFEs), defined in [1]), linking them into a hierarchical structure shaped as a a binary three with a corresponding navigable table, as detailed in [2].

Usage

CELLector.Build_Search_Space<-function(ctumours,
                                       cancerType,
                                       minlen=1,
                                       verbose=TRUE,
                                       mutOnly=FALSE,
                                       cnaOnly=FALSE,
                                       includeHMS=FALSE,
                                       minGlobSupp=0.01,
                                       FeatureToExclude=NULL,
                                       pathway_CFEs = NULL,
                                       pathwayFocused=NULL,
                                       subCohortDefinition=NULL,
                                       NegativeDefinition=FALSE,
                                       cnaIdMap=NULL,
                                       cnaIdDecode=NULL,
                                       hmsIdDecode=NULL,
                                       cdg=NULL,
                                       UD_genomics=FALSE)

Arguments

`ctumours`	A binary event matrix (BEM) modeling a cohort of cancer patients. With cancer functional events (CFEs) on the columns and sample identifers on the rows. See `CELLector.PrimTum.BEMs` for further details
`cancerType`	The cancer type under consideration (specified via a TCGA label): currently available types = BLCA, BRCA, COREAD, GBM, HNSC, KIRC, LAML, LGG, LUAD, LUSC, OV, PRAD, SKCM, STAD, THCA, UCEC
`minlen`	The minimal length of the genomic signatures (how many indivudal CFEs it is made of) in order to be considered in the analysis (1 by default)
`verbose`	A boolean argument specifying whether step-by-step information on the algorithm progression should be displayed run-time
`mutOnly`	A boolean argument specifying whether only CFEs involving somatic mutations should be considered in the analysis. If the `cnaOnly` argument is equal to `TRUE` then this must be `FALSE` (default value)
`cnaOnly`	A boolean argument specifying whether only CFEs involving copy number alterations (CNAs) of chromosomal segments that are recurrently CN altered should be considered in the analysis. If the `mutOnly` argument is equal to `TRUE` then this must be `FALSE` (default value)
`includeHMS`	A boolean argument specitying whether methylation data should be considered while building the searching space (`FALSE` by default).
`minGlobSupp`	Minimal size of the outpputted subtypes, as ratio of the number patients included in the whole cohort (1% by default).
`FeatureToExclude`	A string (or a vector of strings) with identifiers of CFEs that should be ignored
`pathway_CFEs`	A named list of string vectors, whose elements are CFEs involving genes in a biological pathway (specified by the name of the corresponding entry). A list for 14 key cancer pathways is contained in the `CELLector.Pathway_CFEs` data object (see corresponding help page for further deatails)
`pathwayFocused`	If different from `NULL` (default value), it should be a vector of strings. In this case the analysis will consider only CFEs involving genes in a set of pathways, whose names are contained in this argument and must be present as names of the `pathway_CFEs` argument
`subCohortDefinition`	If different from `NULL` (default value), it should be a string containing the identifier of a CFE. In this case the analysis will consider only the primary tumour samples harbouring (or not harbouring, depending on the `NegativeDefinition` argument) the specified CFE
`NegativeDefinition`	If the `subCohortDefinition` argument is not `NULL` then this paramenter determines whether to consider primary tumour samples that harbour (if equal to `FALSE`, default value) or not (if equal to `TRUE`) the specified CFE
`cnaIdMap`	A data frame mapping chromosomal regions of recurrent copy number amplifications/deletions in cancer (RACSs, as defined in [1]) identified via ADMIRE [3] in the context of specific cancer types to PanCancer RACSs. The built-in object `CELLector.CFEs.CNAid_mapping` (or an alternative data frame with the same format) should be used.
`cnaIdDecode`	A table with identifiers of cancer functional events (CFEs) involving chromosomal regions of recurrent copy number alterations (RACSs, as defined by [1], i.e. identified throgh ADMIRE [3]) and their annotation. The built-in object `CELLector.CFEs.CNAid_decode` (or an alternative data frame with the same format) should be used.
`hmsIdDecode`	Data frame containing annotation for the hypermethylated gene promoters CFEs. The format should be the same of the `CELLector.CFEs.HMSid_decode` object.
`UD_genomics`	A boolean argument specifying whether the analysis is performed on user defined genomic data (`TRUE`) or CELLector buit-in genomic data (`FALSE`, default value).
`cdg`	A list of genes that are used when decoding the identifiers of cancer functional events (CFEs) involving chromosomal regions of recurrent copy number alterations (RACSs, as defined by [1]). These will be visualised in the signatures containing the RACSs including them. A predefined list of high confidence cancer driver genes (from [1]) is provided as built-in data object (`CELLector.HCCancerDrivers`)

Details

Starting from an initial cohort of patients affected by a given cancer type and modeled by the inputted binary event matrix (BEM), the most frequent alteration or set of molecular alterations (depending on the minlen argument) with the largest support (the subpopulation of patients in which these alterations occur simultaneously) is identified using the eclat function of the arules R package.

Based on this, the cohort of patients is split into two subpopulations depending on the collective presence or absence of the identified alterations. This process is then executed recursively on the two resulting subpopulations and it continues until all the alteration sets (with a support of minimal size, as specified in the minGlobSupp argument) are identified.

Each of the alterations sets identified through this recursive process is stored in a tree node. Linking nodes identified in adjacent recursions yields a binary tree: the CELLector search space. Each individual path (from the root to a node) of this tree defines a rule (signature), represented as a logic AND of multiple terms (or their negation), one per each node in the path. If the genome of a given patient in the analysed cohort satisfies the rule then it is contained in the subpopulation represented by the terminal node of that path. Collectively, all the paths in the search space provide a representation of the spectrum of combinations of molecular alterations observed in a given cancer type, and their clinical prevalence in the analysed patient population.

Value

A named list with the CELLector search space stored as a data.tree object in the TreeRoot field and as a navigable table: a data frame with a row for each node of the tree and the following columns

Idx: A numerical index for the node
Item: The most supported CFE (or a combination of CFE), identified at the iteration in which the node has been added to the three, (i) in the whole cohort of patients (for the Root), (ii) in the sub population that satisfies the parent node rule (for Left.Child nodes) or (iii) its complement (for Right.Child nodes)
ItemsDecoded: Same as Item but with identifiers of RACSs decoded, i.e. with loci and included driver genes (inputted in the cdg argument), indicated among brackets
Type: The node type: Root (first node added), Right.Child (a node resulting from the analyses of the complementar population of patients with respect to that satisfifying the Parent node rule), Left.Child (a node resulting from refining the population of patients satisfifying the Parent node rule)
Parent.Idx: The numerical index of the parent node (0 for the Root)
AbsSupport: The number of patients satisfying the node rule
CurrentTotal: The number of patients included in the population under consideration at the iteration time of the node inclusion in the tree, this is the same of the parent's AbsSupport for Left.Child nodes
PercSupport: The ratio of patients collectively harbouring the combination of CFEs specified in Items within the subpopulation under consideration at the iteration time of the node inclusion in the tree (whose size is specified in CurrentTotal)
GlobalSupport: The ratio of patients satisfying the node rule with respect to the total number of patients in the whole cohort
Left.Child.Index: Numerical index of the left child node (0 indicates absence of a left child node)
Right.Child.Index: Numerical index of the right child node (0 indicates absence of a right child node)
currentPoints: The identifiers of the patients in the sub-population under consideration at the iteration time of the node inclusion in the tree
currentFeatures: The CFEs considered at the at the iteration time of the node inclusion in the tree
positivePoints: The identifiers of the patients satisfying the node rule
COLORS: A vector of strings containing hexadecimal color identifiers: one for each node. These are used by the visualisation functions (CELLector.visualiseSearchingSpace, and CELLector.visualiseSearchingSpace_sunBurst

, and can be changed using the CELLector.changeSScolors function.

Author(s)

Hanna Najgebauer and Francesco Iorio

References

[1] Iorio, F. et al. A Landscape of Pharmacogenomic Interactions in Cancer. Cell 166, 740–754 (2016).

[2] Najgebauer, H. et al. Genomics Guided Selection of Cancer in vitro Models.

https://doi.org/10.1101/275032

[3] van Dyk, E., Reinders, M. J. T. & Wessels, L. F. A. A scale-space method for detecting recurrent DNA copy number changes with analytical false discovery rate control. Nucleic Acids Res. 41, e100 (2013).

Examples

data(CELLector.PrimTum.BEMs)
data(CELLector.Pathway_CFEs)
data(CELLector.CFEs.CNAid_mapping)
data(CELLector.CFEs.CNAid_decode)
data(CELLector.HCCancerDrivers)

### Change the following two lines to work with a different cancer type
tumours_BEM<-CELLector.PrimTum.BEMs$COREAD

### unicize the sample identifiers for the tumour data
tumours_BEM<-CELLector.unicizeSamples(tumours_BEM)

### building a CELLector searching space focusing on three pathways
### and TP53 wild-type patients only
CSS<-CELLector.Build_Search_Space(ctumours = t(tumours_BEM),
                                  verbose = FALSE,
                                  minGlobSupp = 0.05,
                                  cancerType = 'COREAD',
                                  pathwayFocused = c("RAS-RAF-MEK-ERK / JNK signaling",
                                                     "PI3K-AKT-MTOR signaling",
                                                     "WNT signaling"),
                                  pathway_CFEs = CELLector.Pathway_CFEs,
                                  cnaIdMap = CELLector.CFEs.CNAid_mapping,
                                  cnaIdDecode = CELLector.CFEs.CNAid_decode,
                                  cdg = CELLector.HCCancerDrivers,
                                  subCohortDefinition='TP53',
                                  NegativeDefinition=TRUE)

### visualising the CELLector searching space as a binary tree
CSS$TreeRoot

### visualising the first attributes of the tree nodes
CSS$navTable[,1:11]

### visualising the sub-cohort of patients whose genome satisfies the rule of the 4th node
str_split(CSS$navTable$positivePoints[4],',')


######################################################################
### Rebuilding the search space but considering also methylation data

### important!!!: second version of primary tumours' genomic dataset
### (including methylation data should be loaded)
data(CELLector.PrimTum.BEMs_v2)

### Change the following two lines to work with a different cancer type
tumours_BEM<-CELLector.PrimTum.BEMs_v2$COREAD

### unicize the sample identifiers for the tumour data
tumours_BEM<-CELLector.unicizeSamples(tumours_BEM)

### loading decoding table for hypermethylation CFE identifiers
data(CELLector.CFEs.HMSid_decode)

### building a CELLector searching space
CSS<-CELLector.Build_Search_Space(ctumours = t(tumours_BEM),
                                  verbose = FALSE,
                                  minGlobSupp = 0.05,
                                  cancerType = 'COREAD',
                                  pathway_CFEs = CELLector.Pathway_CFEs,
                                  cnaIdMap = CELLector.CFEs.CNAid_mapping,
                                  cnaIdDecode = CELLector.CFEs.CNAid_decode,
                                  hmsIdDecode = CELLector.CFEs.HMSid_decode,
                                  cdg = CELLector.HCCancerDrivers,
                                  subCohortDefinition='TP53',
                                  NegativeDefinition=TRUE,
                                  includeHMS = TRUE)

### visualising the CELLector searching space as a binary tree
CSS$TreeRoot

### visualising the first attributes of the tree nodes
CSS$navTable[,1:11]

### visualising the sub-cohort of patients whose genome satisfies the rule of the 4th node
str_split(CSS$navTable$positivePoints[4],',')

najha/CELLector documentation built on Sept. 4, 2024, 10:17 p.m.