knitr::opts_chunk$set( collapse = TRUE, comment = "#>" ) # Load library library(dplyr) library(forcats) library(ggplot2) library(GRETTA)
Genetic inteRaction and EssenTiality mApper (GRETTA) is an R package that leverages data generated by the Cancer Dependency Map (DepMap) project to perform in-silico genetic knockout screens and map essentiality networks. A manuscript describing this tool is available at bioinformatics (Takemon, Y. and Marra, MA., 2023).
The DepMap data used in this tutorial is version 22Q2. This version along with all versions provided in this repository were downloaded through the DepMap data portal, which was distributed and used under the terms and conditions of CC Attribution 4.0 license.
This repository is maintained by Yuka Takemon, a PhD candidate in Dr. Marco Marra's laboratory at Canada's Michael Smith Genome Sciences Centre.
Please cite the manuscript describing GRETTA on bioinformatics (Takemon, Y. and Marra, MA., 2023)
Yuka Takemon, Marco A Marra, GRETTA: an R package for mapping in silico genetic interaction and essentiality networks, Bioinformatics, Volume 39, Issue 6, June 2023, btad381, https://doi.org/10.1093/bioinformatics/btad381
Please check the FAQ section for additional information and if you cannot find your answer there or have a request please submit an issue.
You can install the GRETTA package from GitHub with:
install.packages(c("devtools", "dplyr","forcats","ggplot2")) devtools::install_github("ytakemon/GRETTA")
DepMap 22Q2 data and the data documentation files are provided above and can be extracted directly in terminal using the following bash code (not in R/RStudio). For other DepMap data versions please refer to the FAQ section.
# Make a new directory/folder called GRETTA_project and go into directory mkdir GRETTA_project cd GRETTA_project # Download data from the web wget https://www.bcgsc.ca/downloads/ytakemon/GRETTA/22Q2/GRETTA_DepMap_22Q2_data.tar.gz # Extract data and data documentation tar -zxvf GRETTA_DepMap_22Q2_data.tar.gz
A singularity container has also been provided and instructions can be found here.
In this example we use DepMap's 2022 data release (22Q2). However, we also provide previous data released in 2020 (v20Q1) and 2021 (v21Q4), which are available at :https://www.bcgsc.ca/downloads/ytakemon/GRETTA/
. We are hoping to make new data sets available as the are released by DepMap.
GRETTA
and download accompanying data.GRETTA
and download accompanying data.ARID1A encodes a member of the chromatin remodeling SWItch/Sucrose Non-Fermentable (SWI/SNF) complex and is a frequently mutated gene in cancer. It is known that ARID1A and its homolog, ARID1B, are synthetic lethal to one another: The dual loss of ARID1A and its homolog, ARID1B, in a cell is lethal; however, the loss of either gene alone is not (Helming et al., 2014). This example will demonstrate how we can identify synthetic lethal interactors of ARID1A using GRETTA
and predict this known interaction.
For this example you will need to call the following libraries. If you they are not installed yet use install.packages()
(eg. install.packages("dplyr")
).
library(dplyr) library(forcats) library(ggplot2) library(GRETTA)
A small data set has been created for this tutorial and can be downloaded using the following code.
path <- getwd() download_example_data(path)
Then, assign variable that point to where the .rda
files are stored and where result files should go.
gretta_data_dir <- paste0(path,"/GRETTA_example/") gretta_output_dir <- paste0(path,"/GRETTA_example_output/")
One way to explore cell lines that are available in DepMap is through their portal. However, there are some simple built-in methods in GRETTA to provide users with a way to glimpse the data using the series of list_available
functions: list_mutations()
, list_cancer_types()
, list_cancer_subtypes()
Current DepMap data used by default is version 22Q2, which contains whole-genome sequencing or whole-exome sequencing annotations for 1771
cancer cell lines (1406
cell lines with RNA-seq data, 375
cell lines with quantitative proteomics data, and 1086
cell lines with CRISPR-Cas9 knockout screen data)
## Find ARID1A hotspot mutations detected in all cell lines list_mutations(gene = "ARID1A", is_hotspot = TRUE, data_dir = gretta_data_dir)
## List all available cancer types list_cancer_types(data_dir = gretta_data_dir) ## List all available cancer subtypes list_cancer_subtypes(input_disease = "Lung Cancer", data_dir = gretta_data_dir)
As default select_cell_lines()
will identify cancer cell lines with loss-of-function alterations in the gene specified and group them into six different groups.
Loss-of-function alterations include variants that are annotated as: "Nonsense_Mutation", "Frame_Shift_Ins", "Splice_Site", "De_novo_Start_OutOfFrame", "Frame_Shift_Del", "Start_Codon_SNP", "Start_Codon_Del",
and "Start_Codon_Ins"
. Copy number alterations are also taken into consideration and group as "Deep_del", "Loss", "Neutral",
or "Amplified"
.
The cell line groups assigned by default are:
Control
cell lines do not harbor any single nucleotide variations (SNVs) or insertions and deletions (InDels) with a neutral copy number (CN).HomDel
cell lines harbor one or more homozygous deleterious SNVs or have deep CN loss.T-HetDel
cell lines harbor two or more heterozygous deleterious SNVs/InDels with neutral or CN loss.HetDel
cell lines harbor one heterozygous deleterious SNV/InDel with neutral CN, or no SNV/InDel with CN loss.Amplified
cell lines harbor no SNVs/InDels with increased CN.Others
cell lines harbor deleterious SNVs with increased CN.ARID1A_groups <- select_cell_lines(input_gene = "ARID1A", data_dir = gretta_data_dir) ## Show number of cell lines in each group count(ARID1A_groups, Group)
There are several additional filters that can be combined together to narrow down your search. These
input_aa_change
- by amino acid change (eg. "p.Q515*").input_disease
- by disease type (eg. "Pancreatic Cancer")input_disease_subtype
- by disease subtype (eg. "Ductal Adenosquamous Carcinoma")## Find pancreatic cancer cell lines with ARID1A mutations ARID1A_pancr_groups <- select_cell_lines(input_gene = "ARID1A", input_disease = "Pancreatic Cancer", data_dir = gretta_data_dir) ## Show number of cell lines in each group count(ARID1A_pancr_groups, Group)
Of the three mutant cancer cell line groups ARID1A_HomDel
, ARID1A_T-HetDel
, and ARID1A_HetDel
, cancer cell lines with ARID1A_HomDel
mutations are most likely to result in a loss or reduced expression of ARID1A. Therefore, we want to check whether cell lines in ARID1A_HomDel
mutant group have significantly less ARID1A RNA or protein expression compared to control cell lines.
## Select only HomDel and Control cell lines ARID1A_groups_subset <- ARID1A_groups %>% filter(Group %in% c("ARID1A_HomDel", "Control")) ## Get RNA expression ARID1A_rna_expr <- extract_rna( input_samples = ARID1A_groups_subset$DepMap_ID, input_genes = "ARID1A", data_dir = gretta_data_dir)
Not all cell lines contain RNA and/or protein expression profiles, and not all proteins were detected by mass spectrometer. (Details on data generation can be found on the DepMap site.)
## Get protein expression ARID1A_prot_expr <- extract_prot( input_samples = ARID1A_groups_subset$DepMap_ID, input_genes = "ARID1A", data_dir = gretta_data_dir) ## Produces an error message since ARID1A protein data is not available
Using Welch's t-test, we can check to see whether ARID1A RNA expression (in TPM) is significantly reduced in ARID1A_HomDel
cell lines compared to Controls
.
## Append groups and test differential expression ARID1A_rna_expr <- left_join( ARID1A_rna_expr, ARID1A_groups_subset %>% select(DepMap_ID, Group)) %>% mutate(Group = fct_relevel(Group,"Control")) # show Control group first ## T-test t.test(ARID1A_8289 ~ Group, ARID1A_rna_expr) ## plot ggplot(ARID1A_rna_expr, aes(x = Group, y = ARID1A_8289)) + geom_boxplot()
After determining cell lines in the ARID1A_HomDel
group has statistically significant reduction in RNA expression compared to Control
cell lines, the next step is to perform a in silico genetic screen using screen_results()
. This uses the dependency probabilities (or "lethality probabilities") generated from DepMap's genome-wide CRISPR-Cas9 knockout screen.
Lethality probabilities range from 0.0 to 1.0 and is quantified for each gene knock out in every cancer cell line screened (There are 18,334 genes targeted in 739 cancer cell lines). A gene knock out with a lethality probability of 0.0 indicates a non-essential for the cell line, and a gene knock out with a 1.0 indicates an essential gene (ie. very lethal). Details can be found in Meyers, R., et al., 2017
At its core, screen_results()
performs multiple Mann-Whitney U tests, comparing lethality probabilities of each targeted gene between mutant and control groups. This generates a data frame with the following columns:
GeneName_ID
- Hugo symbol with NCBI gene IDGeneNames
- Hugo symbol_median, _mean, _sd, _iqr
- Control and mutant group's median, mean, standard deviation (sd), and interquartile range (iqr) of dependency probabilities. Dependency probabilities range from zero to one, where one indicates a essential gene (ie. KO of gene was lethal) and zero indicates a non-essential gene (KO of gene was not lethal)Pval
- P-value from Mann Whitney U test between control and mutant groups.Adj_pval
- BH-adjusted P-value.log2FC_by_median
- Log2 normalized median fold change of dependency probabilities (mutant / control).log2FC_by_mean
- Log2 normalized mean fold change of dependency probabilities (mutant / control).CliffDelta
- Cliff's delta non-parametric effect size between mutant and control dependency probabilities. Ranges between -1 to 1.dip_pval
- Hartigan's dip test p-value. Tests whether distribution of mutant dependency probability is unimodel. If dip test is rejected (p-value < 0.05), this indicates that there is a multimodel dependency probability distribution and that there may be another factor contributing to this separation.Interaction_score
- Combined value generated from signed p-values: -log10(Pval) * sign(log2FC_by_median). Negative scores indicate lethal genetic interaction, and positive scores indicate alleviating genetic interaction.Warning This process may take a few hours depending on the number of cores assigned. Our example below
GI_screen()
took ~2 hours to process. To save time, we have preprocessed this step for you.
ARID1A_mutant_id <- ARID1A_groups %>% filter(Group %in% c("ARID1A_HomDel")) %>% pull(DepMap_ID) ARID1A_control_id <- ARID1A_groups %>% filter(Group %in% c("Control")) %>% pull(DepMap_ID) ## See warning above. ## This can take several hours depending on number of lines/cores used. screen_results <- GI_screen( control_id = ARID1A_control_id, mutant_id = ARID1A_mutant_id, core_num = 5, # depends on how many cores you have output_dir = gretta_output_dir, # Will save your results here as well as in the variable data_dir = gretta_data_dir, test = FALSE) # use TRUE to run a short test to make sure all will run overnight.
## Load prepared ARID1A screen result load(paste0(gretta_data_dir,"/sample_22Q2_ARID1A_KO_screen.rda"), envir = environment())
We can quickly determine whether any lethal genetic interactions were predicted by GRETTA
. We use a Pval
cut off of 0.05 and rank based on the Interaction_score
.
screen_results %>% filter(Pval < 0.05) %>% arrange(-Interaction_score) %>% select(GeneNames:Mutant_median, Pval, Interaction_score) %>% head
We immediately see that ARID1B, a known synthetic lethal interaction of ARID1A, was a the top of this list.
To perform a small in silico screen, a list of genes can be provided in the gene_list =
argument.
small_screen_results <- GI_screen( control_id = ARID1A_control_id, mutant_id = ARID1A_mutant_id, gene_list = c("ARID1A", "ARID1B", "SMARCA2", "GAPDH", "SMARCC2"), core_num = 5, # depends on how many cores you have output_dir = gretta_output_dir, # Will save your results here as well as in the variable data_dir = gretta_data_dir)
Finally once the in silico screen is complete, results can be quickly visualized using plot_screen()
. Positive genetic interaction scores indicate potential synthetic lethal genetic interactors, and negative scores indicate potential alleviating genetic interactors. As expected, we identified ARID1B as a synthetic lethal interactor of ARID1A.
## Visualize results, turn on gene labels, ## and label three genes each that are predicted to have ## lethal and alleviating genetic interactions, respectively plot_screen(result_df = screen_results, label_genes = TRUE, label_n = 3)
Perturbing genes that function in same/synergistic pathways or in the same complex are said to show similar fitness effects, and these that show effects are considered to be "co-essential". The strategy of mapping co-essential gene have been used by several studies to attribute functions to previously annotated genes as well as to identify a novel subunit of a large complex (Wainberg et al. 2021; Pan et al. 2018).
Given that ARID1A is known subunit of the mammalian SWI/SNF complex (Mashtalir et al. 2018), we expect that members of the SWI/SNF complex would share co-essentiality with ARID1A. This example will demonstrate how we can map ARID1A's co-essential gene network using GRETTA
.
To determine co-essential genes, we will perform multiple Pearson correlation coefficient analyses between ARID1A KO effects and the KO effects of all 18,333 genes. A cut off will be determined by calculating the inflection point of the ranked coefficient curve. As expected find SWI/SNF subunit encoding genes, SMARCE1 and SMARCB1, as the top two co-essential genes.
Warning This process may take several minutes. Our example below
coessential_map()
+get_inflection_points()
took ~17 minutes to process. To save time we have pre-processed this setp for you.
## Map co-essential genes coess_df <- coessential_map( input_gene = "ARID1A", data_dir = gretta_data_dir, output_dir = gretta_output_dir) ## Calculate inflection points of positive and negative curve using co-essential gene results. coess_inflection_df <- get_inflection_points(coess_df)
# Load pre-processed results load(paste0(gretta_data_dir,"/sample_22Q2_ARID1A_coessential_result.rda"), envir = environment()) load(paste0(gretta_data_dir,"/sample_22Q2_ARID1A_coessential_inflection.rda"), envir = environment())
Next, we annotate the data frame containing the co-essential network data and visualize.
## Combine and annotate data frame containing co-essential genes coess_annotated_df <- annotate_df(coess_df, coess_inflection_df) plot_coess( result_df = coess_annotated_df, inflection_df = coess_inflection_df, label_genes = TRUE, # Should gene names be labeled? label_n = 3) # Number of genes to display from each end
We also see that the top ten ARID1A co-essential genes include eight known SWI/SNF subunits, namely ARID1A, SMARCB1, SMARCE1, SMARCC1, SS18, DPF2, SMARCC2, and SMARCD2.
## Show top 10 co-essential genes. coess_annotated_df %>% arrange(Rank) %>% head(10)
Instead of mapping for essentiality across all available cell lines, users can also subset by disease type using the option input_disease = ""
, or within a pre-selected group of cell lines using the option input_cell_lines = c()
. Below we provide an example of how ARID1A essential genes are mapped for pancreatic cancers.
Warning Depending on the number of cell lines that are available after the subsetting step, the inflection point calculation and thresholds may not be optimal. Please use caution when interpreting these results.
## Map co-essential genes in pancreatic cancers only coess_df <- coessential_map( input_gene = "ARID1A", input_disease = "Pancreatic Cancer", core_num = 5, ## Depending on how many cores you have access to, increase this value to shorten processing time. data_dir = gretta_data_dir, output_dir = gretta_output_dir, test = FALSE)
We can also map essentiality across a manually defined list of cell lines using the input_cell_lines = c()
option.
Warning Depending on the number of cell lines provided, the inflection point may not be calculated. Please use caution when interpreting these results.
custom_lines <- c("ACH-000001", "ACH-000002", "ACH-000003",...) coess_df <- coessential_map( input_gene = "ARID1A", input_cell_lines = custom_lines, core_num = 5, ## Depending on how many cores you have access to, increase this value to shorten processing time. data_dir = gretta_data_dir, output_dir = gretta_output_dir, test = FALSE)
sessionInfo()
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.