#boilerplate stuff, sets options for what gets included in the html doc knitr::opts_chunk$set(echo = TRUE, message = FALSE, warning = FALSE)
This notebook involves loading several DepMap datasets, finding cell lines which have TP53 mutations, and then finding genes that have the most differential expression and differential dependency between TP53 mutant vs other cell lines.
NB: Here we're only working with small subsets of genes from these datasets to keep the file sizes smaller. The files you'll need are in the data
directory of the project folder, and have names appended with subset
. You can download the corresponding full files from depmap.org/portal/download
As before, it's best to create a project folder that contains this Rmd notebook, along with a 'data' folder where you can put the needed data files.
The notebook provides practice loading files, working with data frames and matrices, joining tables, and making plots.
To start, let's load the required packages. For the exercise, we will just need the tidyverse
, ggrepel
, here
, data.table
, plotly
and DT
packages.
Remember you can do e.g. install.packages('DT')
if you need to install any missing packages (recall that some packages need to be installed by other means, Google is often the fastest way to figure it out).
library(tidyverse) library(ggrepel) library(DT) library(here) library(data.table) library(plotly)
First load the CCLE sample info table sample_info.csv
, which is in the data
directory of the project folder. Inspect the data after loading it either by looking at the first few rows (head
) or with a summary function (like glimpse
)
Remember that the here
function is useful for specifying where data files are when a project.
sample_info <-
Now let's restrict this sample_info table to just a few columns we're interested using select
. Let's say we only want to retain the following columns: DepMap_ID
, cell_line_name
, lineage
, and lineage_subtype
sample_info <- sample_info %>%
Let's include a table of the sample_info in our report. It's easy to make interactive tables in the reports you generate from your markdown files. The function datatable()
from the DT
package takes a data frame as input and produces a nice interactive table. Try it out to create an interactive table from the restricted sample_info table you have.
Now load the table of mutation info (the file CCLE_mutations_subset.csv
in the data folder). In this table each row represents a detected mutation, and there are lots of columns with different types of info (such as which cell line it was detected in, what type of mutation it was, etc.). Check out the file after loading it (always good practice)
CCLE_mutations <-
Now use filter
and pull
to extract the list of cell lines which have a TP53 mutation. For these purposes, let's define TP53 mutant lines as any line that has a mutation which is either marked as deleterious (isDeleterious
column) or is marked as a COSMIC hotspot mutation (isCOSMIChotspot
column).
TP53_mutant_lines <- CCLE_mutations %>%
Add a new column to the sample_info table is_TP53_mutant
which is TRUE
if that cell line had a detected TP53 mutation and FALSE
otherwise.
NOTE: we're ignoring the fact that we don't have mutation data for some lines.
Hint: the %in%
function is helpful here.
sample_info <- sample_info %>%
Make a table called mutant_by_lineage
which has a row for each lineage in the sample_info
table, and contains the following values for each lineage:
num_TP53_mut
: number of TP53 mutant linestot
: total linesfrac_TP53_mut
: fraction of lines that are TP53 mutantHint: If TP53_mutant
is a logical variable, sum(TP53_mutant)
will calculate the number of cases where it's value is TRUE.
Hint2: The n()
function is a helpful to use with group_by
and summarise
to compute the total number of rows from each group.
mutant_by_lineage <- sample_info %>%
Now sort this table from highest to lowest in terms of the fraction of cell lines which are P53 mutant. Only include lineages with at least 5 cell lines total. You might as well pipe the output into the DT::datatable
function to make it a nice interactive table in the report.
Now load the file CCLE_expression_subset.csv
This is a matrix with gene expression values as log2(TPM+1), with genes as columns and cell lines as rows.
We want to make it a matrix with DepMap_IDs as the rownames. Make the first column (with the DepMap_IDs of the cell lines) into the rownames.
HINT: remember the functions column_to_rownames
. It's also good practice to explicitly convert this into a matrix using as.matrix
.
#note the 'V1' here is just an arbitrary placeholder name that R adds in this case since the file doesn't have a name for the first column containing cell line IDs RNAseq_expr <-
Use the provided get_gene_symbols
function to convert the gene names
#extract HUGO symbol from gene names get_gene_symbols <- function(gene_vec) { str_match(gene_vec, '(.+) \\([0-9]+\\)')[,2] }
The function colMeans
will compute the average across rows for each column of a matrix (returning a vector of column-wise averages).
Use this to compute the average expression of each gene.
Then make a new vector which 'slices out' only the avg. expression values of the following genes: TP53
, MDM2
and DNAJC7
Now we want to compute average gene expression values for TP53 mutant vs other cell lines. Let's do this by creating a function (which we can reuse later on the dependency data).
Name this function get_group_mean_diff
, and have it take a matrix and a vector of cell line IDs as input. We then want it to return a tibble with 4 columns:
gene
(gene names)avg_in_group
avg expression for cell lines in the listavg_out_group
avg expression for cell lines not in the listavg_group_diff
the difference in average expression (in-group minus out-group)The initial sketch of this function is given below, and you'll need to fill in the missing pieces.
NOTE: Use the colMeans
function and make sure to include the parameter na.rm=TRUE
so that we ignore any missing data when computing the averages over rows.
Use your function to compute these differential expression stats and name the resulting table exp_group_stats
. This should return a table with the above fields, and a row for each gene in the expression matrix.
get_group_mean_diff <- function(mat, in_group_cell_lines) { in_group_rows <- rownames(mat) %in% in_group_cell_lines df <- tibble(gene = colnames(mat), ) return(df) }
Find the top 5 genes with the largest magnitude expression difference between TP53 mutant and other lines.
Make a scatterplot comparing the average expression of each geen in TP53 mutant lines vs the difference in average expression between these groups.
Extra You can use the geom_label_repel
function to add gene labels for the top differentially expressed genes you identified above.
The plotly
package can be used to easily convert ggplot
plots into interactive vizs. You can add text info to the tooltip by adding an aesthetic called text
. Make text = gene
so that the gene names are shown when you hover over them.
g <- ggplot() ggplotly(g)
Load the Achilles gene effect data (file Achilles_gene_effect_Chronos_submatrix.csv
) and make it a matrix with DepMap_IDs as the rownames, and convert the column names to HUGO gene symbols using get_gene_symbols
.
This matrix represents estimates of the viability effects of knocking out each gene in each cell line using the Chronos algorithm (more negative values represent stronger dependencies)
Achilles_gene_effect <-
Use your get_group_mean_diff
function to compute a table of stats characterizing the differences in average gene dependencies between TP53 mutant and other lines.
Use inner_join
to merge the tables of differential expression and differential dependency stats. You should include only the avg_group_diff
stat per gene, and rename the values to avg_exp_diff
and avg_dep_diff
before combining the tables.
Remember that the inner_join
will restrict the results to genes that appear in both tables (which is all we need if we're going to plot them against each other)
Make a scatterplot comparing the average difference in gene expression and gene dependency between TP53 mutant and other cell lines using your combined table.
bonus if you can use the geom_label_repel function to label the top 4 most differentially dependent genes (by magnitude)
Here's a challenge/bonus problem:
Write a function get_gene_ttest
which takes a gene name, a data matrix, and a list of 'in-group' cell lines, and returns a tibble with one row containing the stats from a two-sample t-test comparing the matrix's data for that gene between the in-group and non-in-group cell lines.
The tidy
function from the broom
package is helpful for converting the outputs of statistical tests like t.test
and lm
into tibbles.
Then apply your function across all genes and compile the results in a table where each row contains the t-test results for a different gene. Try to do this using the map_dfr
function.
Finally, make a volcano plot of the results, comparing the group mean difference for each gene (will be the estimate
column if you use tidy
) against the q-value
, which you can compute from the p-values
using the p.adjust
function.
library(broom) #useful tidy function for converting stats objects into tidy data frames get_gene_ttest <- function(target_gene, mat, in_group_cell_lines) { return() } all_ach_genes <- colnames(Achilles_gene_effect) all_tres <- map_dfr() ggplot(all_tres, )
The table 'Bailey_2018.xlsx' contains supplementary tables from this paper (https://doi.org/10.1016/j.cell.2018.02.060). Table S1 contains a list of 299 cancer driver genes they identified.
Use the function read_excel()
from the readxl
package to load this data into R. Note: the read_excel
function has a parameter sheet
that lets you specify which sheet of the Excel table you want to load.
library(readxl) TableS1 <-
Use the sample_info table above to find cell lines whose lineage is either pancreas
, gastric
or colorectal
.
Find which gene among the list of cancer drive genes in S1 has the highest average expression among this set of cancer cell lines (the lineages identified above).
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.