#Boilerplate stuff knitr::opts_chunk$set(echo = TRUE, warning = FALSE, message = FALSE)
This tutorial covers basics of differential expression analysis with RNA-Seq data. It is adapted from the wonderful tutorial paper by Law et al. RNA-seq analysis is easy as 1-2-3 with limma, Glimma and edgeR
The dataset used is from Sheridan et al. and consists of 3 mouse mammary gland cell populations (basal, luminal progenitor: LP, and mature luminal: ML), each RNA-Seq profiled in triplicate across three batches (different sequencing lanes).
Our goal here will be to identify genes that are differentially expressed in the basal cells compared to the other two cell types. NOTE: we will use a simplified version of this analysis pipeline for practice purposes. Using the 'real deal' isn't too much more difficult. See [OTHER TUTORIAL] for more details
Additional key refs:
We will be using the package edgeR
. To install this, you need to use a slightly different approach than install.packages
. Run the following code to install.
if (!requireNamespace("BiocManager", quietly = TRUE)) install.packages("BiocManager") BiocManager::install("edgeR")
Create a folder for this project, with a 'data' folder to store the data files you'll use.
Create an Rstudio Project
There are three data files we'll need
rnaseq_counts.csv
: this is a matrix of gene-level read counts. Rows are genes and columns are samples. The genes are labeled by their Entrez IDs, which are unique numeric identifiers for each gene.
rnaseq_sample_info.csv
: contains metadata for each sample.
gene_info
: table with additional information for each gene in the read counts matrix.
Download the files and put them in your project data folder
You'll want to 'fill in the blanks' in each code chunk, and run them in sequence. Remember that you can run all the code in a given chunk by clicking on the green arrow in the top right of the code chunk.
Run this code chunk to load all the packages we'll be using.
library(tidyverse) library(edgeR) #also loads limma library(useful) #helper function corner library(data.table) #for fread library(here)
Use the fread
function to load rnaseq_counts.csv
, and the read_csv
function to load rnaseq_sample_info.csv
and gene_info.csv
.
Remember that if you create an Rstudio 'Project' you can use the here
function to specify the file locations like: here('data', 'my_file.csv')
counts_mat <- fread(here('data', 'rnaseq_counts.csv')) sample_info <- read_csv(here('data', 'rnaseq_sample_info.csv')) gene_info <- read_csv(here('data', 'gene_info.csv'))
Inspect these datasets. Notice that the counts_mat data is loaded as a 'data.table', and the gene IDs are being read as numeric values. Also, R has given this gene ID column a weird name because it was blank in the input file.
head(sample_info) head(gene_info) corner(counts_mat)
We want to fix counts_mat into a matrix of read counts, with the gene IDs as rownames.
First, use the as_tibble
function to convert it into a tibble (rather than data.table, you can ignore these distinctions and just treat this as part of the matrix-loading recipe)
Then, use the column_to_rownames
function to make this Gene ID column into the rownames
Finally use as.matrix
to convert it into a matrix (everything here will work if you dont do this, but it's best practice)
counts_mat <- as_tibble(counts_mat) counts_mat <- column_to_rownames(counts_mat, var = 'V1') counts_mat <- as.matrix(counts_mat)
Let's make sure the data in these 3 variables are properly aligned. The sample names in counts_mat
should be identical to the sample names in the sample_info
table, and should appear in the same order.
Use the all.equal
function to ensure test that this is true.
#make sure that the samples are ordered in the counts mat (by column) in the same order they appear in the rows of the sample info matrix. all.equal(colnames(counts_mat), sample_info$sample)
Same for the gene info. The rownames of the counts matrix should be identical to the entrez gene ids in the gene info table.
Test this with the all.equal
function and note that something isn't quite right. which you can fix.
Hint: the as.character
function will be helpful.
#same for the gene info all.equal(rownames(counts_mat), gene_info$entrezgene_id) #need to convert entrez IDs to strings so they agree with counts_mat rows gene_info$entrezgene_id <- as.character(gene_info$entrezgene_id) all.equal(rownames(counts_mat), gene_info$entrezgene_id)
For genes with very few reads in most/all samples we won't be able to detect any significant differences, and we'll just be hurting our statistical power by performing unnecessary tests, so it's best to exclude them.
First find out how many genes have 0 reads across all samples.
Use the rowSums
function to create a vector that contains the total reads across samples for each gene. Then calculate how many values (genes) are 0.
HINT: remember that R treats logical TRUE as 1 and FALSE as 0 for the purposes of math operations..
tot_reads_per_gene <- rowSums(counts_mat) sum(tot_reads_per_gene == 0)
Let's make a function called get_usable_genes
that identifies the genes in a read count matrix that are sufficiently expressed to include in the analysis. The function should take as input the counts matrix and return a vector of boolean values which indicates whether each gene in the dataset shows sufficient expression.
This function will also have two additional parameters:
- min_cpm
: which is the minimum counts-per-million (cpm) expression level for a gene to be considered expressed in a given sample. A typical threshold for min_cpm
is 1.
- min_expressed_samples
which is the minimum number of samples a gene must be 'expressed' in to be included in analysis.
Make your function take these two input parameters min_cpm
and min_expressed_samples
as inputs, and include default values of 1 and 3 respectively.
get_usable_genes <- function(input_mat, min_cpm = 1, min_expressed_samples = 3) { input_cpm <- cpm(input_mat) #normalize by total counts (to give units of 'counts-per-million') num_expressed_samples <- rowSums(input_cpm >= min_cpm) #number of samples with expression above threshold CPM usable_genes <- num_expressed_samples>=min_expressed_samples #these are genes to include return(usable_genes) }
Use your function to calculate how many genes are expressed (with the default values)?
usable_genes <- get_usable_genes(counts_mat) sum(usable_genes)
How much would the number of expressed genes change if we increased the min_expressed_samples
parameter to 6?
sum(get_usable_genes(counts_mat, min_cpm = 1, min_expressed_samples = 6))
Use your function to find which genes are expressed above min_cpm
in at least 3 but fewer than 6 samples
expressed_in_3 <- get_usable_genes(counts_mat, min_expressed_samples = 3) expressed_in_6 <- get_usable_genes(counts_mat, min_expressed_samples = 6) g3_l6_genes <- rownames(counts_mat)[expressed_in_3 & !expressed_in_6] head(g3_l6_genes)
Now let's use the default values to determine the list of genes we'll include in analysis.
Use this boolean vector returned from your function to create filtered versions of the counts_mat
and gene_info
table called used_counts_mat
and used_gene_info
respectively.
usable_genes <- get_usable_genes(counts_mat) used_counts_mat <- counts_mat[usable_genes,] used_gene_info <- gene_info[usable_genes,]
Now we want to test which genes are differentially expressed in basal samples compared to other cell types.
First, let's make a new column in the sample info table called is_basal
which is true if the sample is group Basal, and FALSE otherwise
sample_info$is_basal <- sample_info$group == 'Basal'
Now let's group all our data into a single compact data object. The DGEList
is a useful tool (from the edgeR
package) for storing RNA-seq datasets. It will take the counts matrix, sample info, and gene info tables as inputs and package them up in a single object.
Make an object called dge
that stores all our data, using the DGEList
function. Remember to provide the used_counts_mat
and used_gene_info
as inputs.
dge <- DGEList(counts = used_counts_mat, samples = sample_info, genes = used_gene_info) print(dim(dge)) # We see that the data has 27k rows and 9 columns head(dge$counts)
(Treating as boilerplate. You'll learn more about this later)
The efit object will contain the info needed to explore our differential expression results.
dge <- calcNormFactors(dge, method = "TMM") #compute size normalization factors mod_matrix <- model.matrix(~is_basal, dge$samples) #define the comparisons we want to make. colnames(mod_matrix)[2] <- 'is_basal' v <- voom(dge, mod_matrix) #this operates on the normalized counts data vfit <- lmFit(v, mod_matrix) #fits the model efit <- eBayes(vfit) #This step allows the model to 'pool' information across genes and gain statistical power
Now we can start looking at the results of the model. This shows the overall numbers of significantly up- and down-regulated genes for each comparison. By default it uses FDR adjusted p-value of 0.05 as the threshold.
Note that the 'intercept' column represents the average expression among non-basal (reference) samples. So, the p-values are not meaningful (they're testing whether the average expression is different from 0)
summary(decideTests(efit))
The topTable
function generates a table of top differentially expressed genes in basal vs non-basal samples. Inspect the table
basal_de <- topTable(efit, coef = 'is_basal', n=Inf, genelist = dge$genes) head(basal_de)
How many genes have logFC magnitude greater than 1 and adjusted p-value less than 0.01? (HINT: you should get 3964 genes)
basal_de_genes <- abs(basal_de$logFC) > 1 & basal_de$adj.P.Val < 0.01 print(table(basal_de_genes))
Make a new column in the basal_de
results table called neg_log_p
which gives the negative log10 p-value.
basal_de$neg_log10_p <- -log10(basal_de$P.Value)
Now make a simple 'volcano' plot of the logFC vs neg-log-p value across genes using the plot
function
plot(basal_de$logFC, basal_de$neg_log10_p)
Create a vector with the names of the top 10 overexpressed genes (highest magnitude logFC value) in basal compared to other cell types.
HINT: the order
function is useful for this.
rank_order <- order(abs(basal_de$logFC), decreasing = TRUE)[1:10] top_oe_genes <- basal_de[rank_order,'mgi_symbol']
Save your table of differential expression results to a 'comma-separated' (csv) file called 'basal_DE_practice_res.csv' in the 'results' folder in your project.
write_csv(basal_de, here('results', 'basal_DE_practice_res.csv'))
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.