Package 'GeneAccord'"

knitr::opts_chunk$set(
  echo = TRUE,
  collapse = TRUE,
  comment = "#>"
)

Detection of clonally exclusive gene or pathway pairs in a cohort of cancer patients

GeneAccord is a statistical framework to examine the combinations of clones that co-exist in tumors [1]. More precisely, the algorithm finds pairs of genes that are mutated in the same tumor but in different clones, i.e. their subclonal mutation profiles are mutually exclusive. We refer to this as clonally exclusive. It means that the mutations occurred in different branches of the tumor phylogeny. The motivation to detect such gene pairs is to uncover potential mechanisms of synergies between co-existing clones in tumors. Our statistical framework assesses whether a pattern of clonal exclusivity occurs more often than expected by chance alone across a cohort of patients. It makes use of a likelihood ratio test to compare the number of observed clonal exclusivities of a gene pair of interest across the cohort to the background distribution of the expected frequencies of clonal exclusivity. The null hypothesis is that the gene pair of interest exhibits the pattern of clonal exclusivity as often as expected by chance when assuming independent occurrence of mutations. The alternative hypothesis is that the pattern occurs at a rate greater than random chance. Multiple testing correction is done using the Benjamini-Hochberg procedure [2]. If a pair is significantly clonally exclusive, it suggests that this specific clone configuration confers a selective advantage, possibly through synergies between the clones with these mutations. The required input data are the mutated gene-to-clone assignments from a cohort of cancer patients, which were obtained by running phylogenetic tree inference methods. Reconstructing the evolutionary history of a tumor and detecting the clones is challenging [3]. For nondeterministic algorithms, repeated tree inference runs may lead to slightly different mutation-to-clone assignments. Therefore, our algorithm was designed to allow the input of multiple gene-to-clone assignments per patient. They may have been generated by repeatedly performing the tree inference, or by sampling from the posterior distribution of trees. The tree inference methods designate the mutations to individual clones. The mutations can then be mapped to genes or pathways. Hence our statistical framework can be applied on the gene level, or on the pathway level to detect clonally exclusive pairs of pathways. If you use GeneAccord in your published work, please cite the paper in which it was introduced [1], where you can also find a more detailed description of the statistical framework.


Installation

You can install the package by typing the following commands in R:

install.packages("devtools")

# to install its dependencies
install.packages(c("biomaRt", "caTools", "dplyr", "ggplot2", "gtools", "ggpubr", "magrittr", "maxLik", "RColorBrewer", "reshape2", "tibble"))

library(devtools)

install_github("cbg-ethz/GeneAccord")
library(GeneAccord)

1. Prepare patient data

The first step consists of reading in and processing the patient data separately for each patient. If it takes too much time, this can be done for all patients in parallel on a cluster. In this example, we read in the data for patient "50". The example data is stored in the external data directory of this package. The phylogenetic tree inference was performed with the R package cloe [4], which was run 20 times with 20 different seeds (seed = 5, 10, 15, ..., 100) to capture the uncertainty in the phylogenetic tree inference. All 20 gene-to-clone assignments for patient 50 can be found as follows.

# defines the path to the external data directory
ext_data_dir <- system.file('extdata', package = 'GeneAccord')

# set the patient id
pat_id <- "50"

# find all csv-files from patient 50, which are in the clonal_genotypes subdirectory, with seeds 5, 10, 15, ..., 100
input_files_50 <- paste(ext_data_dir, "/clonal_genotypes/cloe_seed", seq(5, 100, by = 5), "/", pat_id, ".csv", sep = "")
# 'input_files_50' is now a vector with all 20 csv-files for this patient

So to analyze your own data, you need to set a vector with all gene-to-clone assignments in csv-format for each patient. Such a csv-file may look like this.

head(read.csv(input_files_50[1]))

The first column lists the altered genes or pathways. In this manual, we will use the term genes, even if the altered entities can also be pathways. The columns that come after the gene indicate in which clone the gene is mutated. In this package, data tables or csv-files are stored and used as tibbles. Tibbles are a novel way to represent data frames. Please refer to the R package tibble [5] for more information. Now we read in the patient's gene-to-clone assignments from the collection of tree inferences to obtain it in a tibble format.

clone_tbl_50_all_trees <- create_tbl_tree_collection(input_files_50)

Let's look at the tibble:

clone_tbl_50_all_trees

The column file_name indicates from which csv-file the information comes. The patient identifier is stored in patient_id, and the column altered_entity lists the mutated gene. The gene-to-clone assignments are in the columns clone1, clone2, ..., up to, by default clone7. Even if this patient had fewer clones in the assignment, the additional clone columns make sure that all data from all patients can be stored in the same tibble. The additional clone columns are just always zero, if a patient had fewer clones. If you know that you have more than seven clones at maximum across all patients and trees, the parameter max_num_clones of create_tbl_tree_collection can be set accordingly. The last column in the tibble tree_id indicates from which of the trees the respective gene-to-clone assignment comes. In this tibble, we have 20 different tree identifiers, because we use the gene-to-clone assignments from 20 tree inference runs.

# print all 20 tree id's
unique(as.character(clone_tbl_50_all_trees$tree_id))

The next step is to compute the rate of clonal exclusivity for this patient, and for each of the tree inferences separately.

rates_clon_excl_50 <- compute_rates_clon_excl(clone_tbl_50_all_trees)
head(rates_clon_excl_50)

# average rate for patient 50 across trees
m_50 <- mean(rates_clon_excl_50)

Now, for each gene pair separately, two values are computed: The number of times the gene pair occurs in the collection of trees, and the number of times it was clonally exclusive in these trees. A gene pair does not have to occur in all trees, because sometimes a gene is not assigned to a clone in a tree. Hence, the number of times a pair occurs may be, e.g. 19 of 20 times. And the number of times a pair is clonally exlcusive in the trees of course varies from gene pair to gene pair.

# compute the histogram of how often gene pairs occur in the trees, and how often they are clonally exclusive
hist_of_num_trees_clon_excl_50 <- get_hist_clon_excl(clone_tbl_50_all_trees)
hist_of_num_trees_clon_excl_50[[1]]
hist_of_num_trees_clon_excl_50[[2]]

We observe, for instance, that one gene pair occurs in 19 trees, and is clonally exclusive in 18 of these.

The three steps demonstrated here (create_tbl_tree_collection, compute_rates_clon_excl, and get_hist_clon_excl) for patient 50 have to be done for all patients in the cohort. If you have a high number of patients, you can run it in parallel on a computing cluster. If you choose to run it sequentially, the following chunk of code is an example of how this would need to be done.

# set the patient identifiers
all_pat_ids <- c()
for(i in seq(1, 89)){
  if(i %in% c(3, 34, 51, 53, 58, 65, 87)) {
    next
  } 
  all_pat_ids <- c(all_pat_ids, ifelse(i<10, paste0("0", i), as.character(i)))
}

# this will be the vector with the average rates of clonal exclusivity of each patient
avg_rates_m <- c()

# these will be the two lists with an entry for each patient with
# a histogram of how often pairs occur in the collection of trees
list_of_num_trees_all_pats <- list()
# a histogram of how ofthen pairs were clonally exclusive
list_of_clon_excl_all_pats <- list()
# this is the list of tibbles that will contain the gene-to-clone assignments from all patients from all trees
clone_tbl_all_pats_all_trees_list <- list()


# loop over all patients and prepare the input data
cnt <- 0
for (this_pat in all_pat_ids){
  cnt <- cnt + 1

  # find all csv-files from current patient, which are in the clonal_genotypes subdirectory, with seeds 5, 10, 15, ..., 100
  input_files_this_pat <- paste0(ext_data_dir, "/clonal_genotypes/cloe_seed", seq(5, 100, by = 5), "/", this_pat, ".csv")

  # here, there should be files from 20 different seeds
  stopifnot(length(input_files_this_pat) == 20)

  # generate the tibble containing the gene-to-clone-assignments
  clone_tbl_this_pat_all_trees <- create_tbl_tree_collection(input_files_this_pat)

  # compute the rates of clonal exclusivity
  rates_clon_excl_this_pat <- compute_rates_clon_excl(clone_tbl_this_pat_all_trees)

  # average rate for current patient across trees
  m_this_pat <- mean(rates_clon_excl_this_pat)

  # compute the histogram of how often gene pairs occur in the trees, and how often they are clonally exclusive
  hist_of_num_trees_clon_excl_this_pat <- get_hist_clon_excl(clone_tbl_this_pat_all_trees)

  # save this information into named vectors/lists
  avg_rates_m <- c(avg_rates_m, m_this_pat)
  list_of_num_trees_all_pats[[cnt]] <- hist_of_num_trees_clon_excl_this_pat[[1]]
  list_of_clon_excl_all_pats[[cnt]] <- hist_of_num_trees_clon_excl_this_pat[[2]]
  names(avg_rates_m)[cnt] <- names(list_of_num_trees_all_pats)[cnt] <- names(list_of_clon_excl_all_pats)[cnt] <- this_pat
  clone_tbl_all_pats_all_trees_list[[cnt]] <- clone_tbl_this_pat_all_trees 
}

# Now we aggregate the list of tibbles into one big tibble
clone_tbl_all_pats_all_trees <- do.call("rbind", clone_tbl_all_pats_all_trees_list)

The herewith generated avg_rates_m, list_of_num_trees_all_pats, list_of_clon_excl_all_pats, and clone_tbl_all_pats_all_trees will be used as input to the functions used in steps 2 and 3 of this vignette.


2. Generate null distribution of test statistic

In order to generate the null distribution of the test statistic, the empirical cumulative distribution function (ECDF) of the test statistic is computed. The null hypothesis assumes independence of mutations, and therefore states that the gene pairs occur across the cohort in the clonal exclusivity pattern as often as expected by chance according to the overall rates of clonal exclusivity. The function to compute the ECDF of the test statistic under the null generate_ecdf_test_stat takes as input from all patients the data computed in step 1, namely the average rate of clonal exclusivity, as well as the histogram of how often pairs occur in the tree, and how often they were clonally exclusive. Here, it is assumed that step 1 of this vignette was run for all patients, and that you have the parameters avg_rates_m, list_of_num_trees_all_pats, list_of_clon_excl_all_pats, and clone_tbl_all_pats_all_trees. If not, you can load the precomputed data.

# load the data as generated in step 1 from all 82 patients
data("clone_tbl_all_pats_all_trees")
data("avg_rates_m")
data("list_of_num_trees_all_pats")
data("list_of_clon_excl_all_pats")

Now we can have a look at this data.

# the tibble containing the gene-to-clone assignments from all patients and the whole collection of trees
print(clone_tbl_all_pats_all_trees)
# the average rates of clonal exclusivity from each patient
head(avg_rates_m)
# the histograms of how often pairs were mutated across the collection of trees - here printed for patient 02
list_of_num_trees_all_pats[[2]]
# the histograms of how often pairs were clonally exclusive across the collection of trees - here printed for patient 02
list_of_clon_excl_all_pats[[2]]

For each number of patients, a pair can be mutated in, an ECDF is built separately. With the parameter num_pat_pair_max, it can be decided up to which number to compute it. You can use the function pairs_in_patients_hist to find out what is the maximum number of patients that pairs are mutated in. All unique gene pairs are efficiently computed using functionalities from the R package caTools [6].

pairs_in_patients_hist(clone_tbl_all_pats_all_trees)

In this data set, a total of 5,945 pairs are mutated in just one patient, and 274 pairs are mutated in two patients. GeneAccord will only consider pairs that occur in more than one patient. One pair is mutated in 15 patients, which is the maximum number. So, in practice, num_pat_pair_max = 15 should be chosen. For the vignette, to demonstrate the functionality of the package, we set it to three such that it runs faster.

num_pat_pair_max <- 3

The following command will generate the ECDF of the test statistic under the null for this data set. The number of pairs that are simulated is set to 100. This step may take quite some time.

set.seed(1234)

ecdf_list <- generate_ecdf_test_stat(avg_rates_m, list_of_num_trees_all_pats, list_of_clon_excl_all_pats, num_pat_pair_max, num_pairs_sim = 100)

In practice, it is recommended that the parameter num_pairs_sim is set to a higher value, to get most accurate results. Here, as a demonstration, it was set to 100 such that it runs faster. However, we advice to choose rather a higher number in your analysis, e.g. 100,000. The function then takes much more time, but it could be run over night on a computing cluster. For the purpose of this vignette, the pre-computed ECDF can be loaded, which was run with num_pairs_sim = 100000.

data("ecdf_list")

The distribution of the test statistic under the null will later be used in the clonal exclusivity test to evaluate how unlikely an actually observed test statistic is.


3. Clonal exclusivity test

Now, we're ready to perform the clonal exclusivity test to identify pairs of genes that are mutated in different clones more often than expected. To this end, we call the function GeneAccord. The required input parameters are the tibble containing the gene-to-clone assignments from all patients and all trees, the average rates of clonal exclusivity, and the ECDF's as computed in step 2 of this vignette. This function makes use of functionalities from the R packages gTools [7], and maxLik [8].

# perform the clonal exclusivity test
res_pairs <- GeneAccord(clone_tbl_all_pats_all_trees, avg_rates_m, ecdf_list)

4. Examining and visualizing results

The results tibble now contains all pairs of genes that are mutated together in at least two patients, and whose delta parameter was greater zero, indicating that they tend to be clonally exclusive. See [1] for more details about the parameter delta. We can extract the significant pairs using the pipe command %>% from the R package magrittr [9], and using the function filter from the R package dplyr [10].

suppressMessages(library(magrittr))
suppressMessages(library(dplyr))

# extract all pairs with an adjusted p-value less than 0.05
sig_pairs <- res_pairs %>% filter(qval < 0.05)

# print the top pairs
sig_pairs[order(sig_pairs$pval),]

We notice that the gene identifiers are ensembl gene id's. In order to find out the corresponding gene names, you can use the function map_pairs_to_hgnc_symbols. First, a tibble containing gene id conversions needs to be created. The function to do so, create_ensembl_gene_tbl_hg, makes use of functionalities of the Bioconductor R package biomaRt [11, 12].

# this will create a tibble with ensembl gene id's and corresponding hgnc gene symbols of the human genes
all_genes_tbl <- create_ensembl_gene_tbl_hg()

You can also load it from the pre-computed data of this package.

data("all_genes_tbl")
# get the hgnc symbols of the significant pairs
sig_pairs <- map_pairs_to_hgnc_symbols(sig_pairs, all_genes_tbl)

# print the top pairs - here just showing the hgnc gene symbols and not the ensembl gene id's 
sig_pairs[order(sig_pairs$pval),] %>% select(-entity_A, -entity_B, -num_patients)

The HGNC gene symbols are now displayed in two additional columns. We are also interested in which patients these pairs are mutated and clonally exclusive.

# find out in which patients they occur and are clonally exclusive
sig_pairs <- take_pairs_and_get_patients(clone_tbl_all_pats_all_trees, sig_pairs)

# print the top pairs
sig_pairs[order(sig_pairs$pval),] %>% select(-entity_A, -entity_B, -mle_delta, -test_statistic, -num_patients)

The top pair is mutated in patients 05 and 81, and is clonally exclusive in both of them. The results tibble can be saved as a tab-separated file.

# define output file
sig_pairs_tsv <- "GeneAccord_sig_pairs.tsv"

# save the results into the tsv-file
sig_pairs_table <- write_res_pairs_to_disk(sig_pairs, avg_rates_m, sig_pairs_tsv)
rownames(sig_pairs_table) <- NULL
knitr::kable(sig_pairs_table)
if(file.exists(sig_pairs_tsv)){
  suppressMessages(file.remove(sig_pairs_tsv))
}

The ECDF's of the test statistic under the null hypothesis can also be visualized.

# plot the ECDF of the test statistic under the null
plot_ecdf_test_stat(ecdf_list)

Using functions from the R package dplyr [10], we can extract information such as how many pairs are mutated in how many patients.

res_pairs %>% select(num_patients) %>% group_by(num_patients) %>% tally()

We observe that most pairs are just mutated in two patients, and a few are mutated in three patients. These pairs in the results tibble were those mutated in more than one patient and who tend to be clonally exclusive.

The average rates of clonal exclusivity can also be visualized with the function plot_rates_clon_excl, which makes use of functionalities from the R package ggplot2 [13].

plot_rates_clon_excl(avg_rates_m, clone_tbl_all_pats_all_trees)

The function also returns the average number of clones across the collection of trees for each patient. A clone is only counted if at least one gene is assigned to it.

Finally, the pairs of interest can also be visualized in a heatmap that shows in the respective patient in which clone the gene is mutated. The function heatmap_clones_gene_pat makes use of functionalities from the R packages ggpubr [14], and reshape2 [15].

# extract the patient id's in which the pairs are mutated
pairs_of_interest <- sig_pairs %>% filter(qval < 0.02)
pat_ids_of_interest <- unique(unlist(strsplit(as.character(pairs_of_interest$mutated_in), ";")))

# from the gene-to-clone tibble extract the data for one tree
this_tree_id <- 1
clone_tbl_all_pats_tree1 <- clone_tbl_all_pats_all_trees %>% 
                   filter(tree_id == this_tree_id, patient_id %in% pat_ids_of_interest) %>% 
                   select(-tree_id)

# plot the heatmap of clones and genes for the patients and genes of interest
heatmap_clones_gene_pat(pairs_of_interest, clone_tbl_all_pats_tree1, all_genes_tbl, first_clone_is_N = TRUE)

Pathway level analysis

Especially if you have whole-exome sequencing data, the GeneAccord analysis can also be done on the pathway level. In that case, the mutated genes that are assigned to clones would be mapped to pathways. As an example, we demonstrate here how to convert a tibble that contains the gene-to-clone assignments into a tibble that contains pathway-to-clone assignments using the Reactome pathway data base [16].

data("ensg_reactome_path_map")
clone_tbl_pat01_tree1 <- clone_tbl_all_pats_all_trees %>%
                            filter(patient_id == "01") %>%
                            filter(tree_id == 1) %>%
                            select(-tree_id)

clone_tbl_pat01_tree1_pw <- convert_ensembl_to_reactome_pw_tbl(clone_tbl_pat01_tree1, ensg_reactome_path_map)


clone_tbl_pat01_tree1_pw

Other alternative

The GeneAccord framework also offers the option of testing two-sided, meaning that you test pairs for significance that tend to be in different clones and also pairs that tend to be in the same clones more often than expected. The number of pairs to test with option alternative = 'two.sided' increases a lot, therefore, GeneAccord includes the additional option of testing only specific pre-defined genes of interest.

# this tells GeneAccord that both cases should be tested: 
# pairs that tend to co-occur together in the same clone more often than expected,
# and pairs that tend to be clonally exclusive.
alternative <- "two.sided"

# to limit the number of tests done, the genes that should be
# tested can be pre-specified
# here, the genes VHL, PXDN
genes_of_interest <- c("ENSG00000134086", "ENSG00000130508")

# this tells GeneAccord that we only want to test pairs where both gene A AND gene B
# are from the 'genes of interest', so we're only testing the pair {VHL, PXDN}
AND_OR <- "AND"

# run GeneAccord with these options
res_pairs_ts <- suppressMessages(GeneAccord(clone_tbl_all_pats_all_trees, avg_rates_m, ecdf_list, alternative, genes_of_interest, AND_OR))

res_pairs_ts

The parameter AND_OR can also be set to OR, meaning that all pairs will be tested where at least one of the two genes is in the set of genes of interest.

AND_OR <- "OR"

# Then, to limit the number of tests performed, it is recommended to only specify a small
# number of genes
# here, only SART3
genes_of_interest <- c("ENSG00000075856")

# run GeneAccord testing all pairs that include the gene SART3
res_pairs_ts <- suppressMessages(GeneAccord(clone_tbl_all_pats_all_trees, avg_rates_m, ecdf_list, alternative, genes_of_interest, AND_OR))

res_pairs_ts

Session information

sessionInfo()

References:

[1] Moore AL, Kuipers J, Singer J, Burcklen E, Schraml P, Beisel C, Moch H, Beerenwinkel N. Intra-tumor heterogeneity and clonal exclusivity in renal cell carcinoma. (2018). bioRxiv 305623. doi: https://doi.org/10.1101/305623

[2] Benjamini Y, Hochberg Y. Controlling the false discovery rate: a practical and 26 powerful approach to multiple testing. Journal of the royal statistical society Series B 27 (Methodological). 1995:289-300 %@ 0035-9246.

[3] Beerenwinkel N, Schwarz RF, Gerstung M, Markowetz F. Cancer evolution: mathematical models and computational inference. Syst Biol. 2015;64(1):e1-25.

[4] Marass F, Mouliere F, Yuan K, Rosenfeld N, Markowetz F. A phylogenetic latent feature model for clonal deconvolution. The Annals of Applied Statistics. 2016;10(4):2377-404 %@ 1932-6157.

[5] Kirill Mueller and Hadley Wickham (2018). tibble: Simple Data Frames. R package version 1.4.2. https://CRAN.R-project.org/package=tibble

[6] Jarek Tuszynski (2014). caTools: Tools: moving window statistics, GIF, Base64, ROC AUC, etc.. R package version 1.17.1. https://CRAN.R-project.org/package=caTools

[7] Gregory R. Warnes, Ben Bolker and Thomas Lumley (2015). gtools: Various R Programming Tools. R package version 3.5.0. https://CRAN.R-project.org/package=gtools

[8] Arne Henningsen and Ott Toomet (2011). maxLik: A package for maximum likelihood estimation in R. Computational Statistics 26(3), 443-458. DOI 10.1007/s00180-010-0217-1.

[9] Stefan Milton Bache and Hadley Wickham (2014). magrittr: A Forward-Pipe Operator for R. R package version 1.5. https://CRAN.R-project.org/package=magrittr

[10] Hadley Wickham, Romain Francois, Lionel Henry and Kirill Mueller (2017). dplyr: A Grammar of Data Manipulation. R package version 0.7.4. https://CRAN.R-project.org/package=dplyr

[11] Durinck S, Moreau Y, Kasprzyk A, Davis S, De Moor B, Brazma A and Huber W (2005). "BioMart and Bioconductor: a powerful link between biological databases and microarray data analysis." Bioinformatics, 21, pp. 3439-3440.

[12] Durinck S, Spellman P, Birney E and Huber W (2009). "Mapping identifiers for the integration of genomic datasets with the R/Bioconductor package biomaRt." Nature Protocols, 4, pp. 1184-1191.

[13] H. Wickham. ggplot2: Elegant Graphics form Data Analysis. Springer-Verlag New York, 2009.

[14] Alboukadel Kassambara (2017). ggpubr: 'ggplot2' Based Publication Ready Plots. R package version 0.1.6. https://CRAN.R-project.org/package=ggpubr

[15] Hadley Wickham (2007). Reshaping Data with the reshape Package. Journal of Statistical Software, 21(12), 1-20. URL http://www.jstatsoft.org/v21/i12/.

[16] Matthews L, Gopinath G, Gillespie M, Caudy M, Croft D, de Bono B, et al. Reactome knowledgebase of human biological pathways and processes. Nucleic acids research. 2009;37(Database issue):D619-22.



Try the GeneAccord package in your browser

Any scripts or data that you put into this service are public.

GeneAccord documentation built on Nov. 8, 2020, 8:04 p.m.