A user-friendly grammar of bulk RNA sequencing data exploration and processing
library(knitr) #library(kableExtra) knitr::opts_chunk$set(cache = TRUE, warning = FALSE, message = FALSE, cache.lazy = FALSE) #options(width = 120) options(pillar.min_title_chars = Inf) library(tibble) library(dplyr) library(magrittr) library(tidyr) library(ggplot2) library(readr) library(widyr) library(foreach) library(rlang) library(purrr) library(tidysc) my_theme = theme_bw() + theme( panel.border = element_blank(), axis.line = element_line(), panel.grid.major = element_line(size = 0.2), panel.grid.minor = element_line(size = 0.1), text = element_text(size=12), legend.position="bottom", aspect.ratio=1, strip.background = element_blank(), axis.title.x = element_text(margin = margin(t = 10, r = 10, b = 10, l = 10)), axis.title.y = element_text(margin = margin(t = 10, r = 10, b = 10, l = 10)) ) # counts = # tidysc::counts %>% # filter(transcript %in% (tidysc::X_cibersort %>% rownames)) %>% # filter(sample %in% c("SRR1740034", "SRR1740035", "SRR1740058", "SRR1740043", "SRR1740067")) %>% # mutate(condition = ifelse(sample %in% c("SRR1740034", "SRR1740035", "SRR1740058"), T, F))
tidysc is a collection of wrapper functions for bulk tanscriptomic analyses that follows the "tidy" paradigm. The data structure is a tibble with columns for
counts = tidysc::counts counts
In brief you can: + Going from BAM/SAM to a tidy data frame of counts (FeatureCounts) + Adding gene symbols from ensembl IDs + Aggregating duplicated gene symbols + Adding normalised counts + Adding principal .dims + Adding MDS .dims + Rotating principal component or MDS dimensions + Running differential transcript abunance analyses (edgeR) + Adding batch adjusted counts (Combat) + Eliminating redunant samples and/or genes + Clustering samples and/or genes with kmeans + Adding tissue composition (Cibersort)
transcripts
tidysc provide the aggregate_duplicates
function to aggregate duplicated transcripts (e.g., isoforms, ensembl). For example, we often have to convert ensembl symbols to gene/transcript symbol, but in doing so we have to deal with duplicates. aggregate_duplicates
takes a tibble and column names (as symbols; for sample
, transcript
and count
) as arguments and returns a tibble with aggregate transcript with the same name. All the rest of the column are appended, and factors and boolean are appended as characters.
counts.aggr = counts %>% aggregate_duplicates( .sample = sample, .cell = cell, .transcript = transcript, .abundance = `count`, aggregation_function = sum ) counts.aggr
tt
objecttt = tidysc_long( counts.aggr, .sample = sample, .cell = cell, .transcript = transcript, .abundance = `count`, species = "Human" ) tt
By default, the trabscript abundance is not shown (in order to save memory), but can be extracted for plotting or further analysis
tt %>% extract_abundance( ) %>% select(sample, cell, transcript, count_RNA, everything()) tt
counts
We may want to calculate the normalised counts for library size (e.g., with TMM algorithm, Robinson and Oshlack doi.org/10.1186/gb-2010-11-3-r25). scale_abundance
takes a tibble, column names (as symbols; for sample
, transcript
and count
) and a method as arguments and returns a tibble with additional columns with normalised data as <NAME OF COUNT COLUMN> normalised
.
tt.norm = tt %>% scale_abundance(verbose = F) tt.norm %>% extract_abundance(all=T) %>% select(sample, cell, transcript, `count_RNA`, `count_normalised`, everything())
We can easily plot the normalised density to check the normalisation outcome. On the x axis we have the log scaled counts, on the y axes we have the density, data is grouped by sample and coloured by cell type.
tt.norm %>% extract_abundance(all=T) %>% gather(normalisation, abundance, c(count_RNA, count_normalised)) %>% ggplot(aes(`abundance` + 1, group=cell, color=sample)) + geom_density(alpha=0.5) + scale_x_log10() + facet_grid(normalisation~sample) + my_theme
dimensions
PCA
tt.norm.PCA = tt.norm %>% reduce_dimensions(method="PCA", .dims = 3) tt.norm.PCA %>% select(sample, contains("PC"), tech ) %>% distinct()
On the x and y axes axis we have the reduced dimensions 1 to 3, data is coloured by cell type.
tt.norm.PCA %>% select(contains("PC"), sample, tech) %>% distinct() %>% GGally::ggpairs(columns = 1:3, ggplot2::aes(colour=tech))
tSNE
tt.norm.tSNE = tt.norm %>% reduce_dimensions(method = "tSNE" ) tt.norm.tSNE %>% select(contains("tSNE", ignore.case = F), sample, everything()) %>% distinct() tt.norm.tSNE %>% select(contains("tSNE", ignore.case = F), sample, tech) %>% distinct() %>% ggplot(aes(x = `tSNE 1`, y = `tSNE 2`, color=tech)) + geom_point() + my_theme
UMAP
tt.norm.UMAP = tt.norm %>% reduce_dimensions(method = "UMAP" ) tt.norm.UMAP %>% select(contains("UMAP", ignore.case = F), sample, everything()) %>% distinct() tt.norm.UMAP %>% select(contains("UMAP", ignore.case = F), sample, tech) %>% distinct() %>% ggplot(aes(x = `UMAP 1`, y = `UMAP 2`, color=tech)) + geom_point() + my_theme
dimensions
We may want to rotate the reduced dimensions (or any two numeric columns really) of our data, of a set angle. rotate_dimensions
takes a tibble, column names (as symbols; for sample
, transcript
and count
) and an angle as arguments and returns a tibble with additional columns for the rotated dimensions. The rotated dimensions will be added to the original data set as <NAME OF DIMENSION> rotated <ANGLE>
by default, or as specified in the input arguments.
{r rotate, cache=TRUE} tt.norm.UMAP.rotated = tt.norm.UMAP %>% rotate_dimensions( `UMAP 1`, `UMAP 2`, rotation_degrees = 45 )
Original On the x and y axes axis we have the first two reduced dimensions, data is coloured by cell type.
{r plot_rotate_1, cache=TRUE} tt.norm.UMAP.rotated %>% distinct(sample, `UMAP 1`,`UMAP 2`, `Cell type`) %>% ggplot(aes(x=`UMAP 1`, y=`UMAP 2`, color=`Cell type` )) + geom_point() + my_theme
Rotated On the x and y axes axis we have the first two reduced dimensions rotated of 45 degrees, data is coloured by cell type.
{r plot_rotate_2, cache=TRUE} tt.norm.UMAP.rotated %>% distinct(sample, `UMAP 1 rotated 45`,`UMAP 2 rotated 45`, `Cell type`) %>% ggplot(aes(x=`UMAP 1 rotated 45`, y=`UMAP 2 rotated 45`, color=`Cell type` )) + geom_point() + my_theme
differential transcription
We may want to test for differential transcription between sample-wise factors of interest (e.g., with edgeR). test_differential_abundance
takes a tibble, column names (as symbols; for sample
, transcript
and count
) and a formula representing the desired linear model as arguments and returns a tibble with additional columns for the statistics from the hypothesis test (e.g., log fold change, p-value and false discovery rate).
{r de, cache=TRUE} counts %>% test_differential_abundance( ~ condition, action="get")
counts
We may want to adjust counts
for (known) unwanted variation. adjust_abundance
takes as arguments a tibble, column names (as symbols; for sample
, transcript
and count
) and a formula representing the desired linear model where the first covariate is the factor of interest and the second covariate is the unwanted variation, and returns a tibble with additional columns for the adjusted counts as <COUNT COLUMN> adjusted
. At the moment just an unwanted covariated is allowed at a time.
counts.norm.adj = tt.norm %>% adjust_abundance( ~ sample, verbose=F ) counts.norm.adj.UMAP = counts.norm.adj %>% reduce_dimensions(method = "UMAP" ) counts.norm.adj.UMAP %>% select(contains("UMAP", ignore.case = F), sample, tech) %>% distinct() %>% ggplot(aes(x = `UMAP 1`, y = `UMAP 2`, color=tech)) + geom_point() + my_theme
clusters
We may want to cluster our data (e.g., using SNN sample-wise). cluster_elements
takes as arguments a tibble, column names (as symbols; for sample
, transcript
and count
) and returns a tibble with additional columns for the cluster annotation. At the moment only SNN clustering is supported, the plan is to introduce more clustering methods.
SNN
counts.norm.adj.UMAP.cluster = counts.norm.adj.UMAP %>% cluster_elements() counts.norm.adj.UMAP.cluster
We can add cluster annotation to the MDS dimesion reduced data set and plot.
counts.norm.adj.UMAP.cluster %>% distinct(sample, `UMAP 1`, `UMAP 2`, `cluster`) %>% ggplot(aes(x=`UMAP 1`, y=`UMAP 2`, color=`cluster`)) + geom_point() + my_theme
Cell type composition
We may want to infer the cell type composition of our samples (with the algorithm Cibersort; Newman et al., 10.1038/nmeth.3337). deconvolve_cellularity
takes as arguments a tibble, column names (as symbols; for sample
, transcript
and count
) and returns a tibble with additional columns for the adjusted cell type proportions.
columns truncated
counts.norm.adj.UMAP.cluster.ct = counts.norm.adj.UMAP.cluster %>% deconvolve_cellularity() counts.norm.adj.UMAP.cluster.ct %>% select(cell, `Cell type Blueprint_Encode`, everything())
With the new annotated data frame, we can plot the distributions of cell types across samples, and compare them with the nominal cell type labels to check for the purity of isolation. On the x axis we have the cell types inferred by Cibersort, on the y axis we have the inferred proportions. The data is facetted and coloured by nominal cell types (annotation given by the researcher after FACS sorting).
counts.norm.adj.UMAP.cluster.ct %>% distinct(sample, `UMAP 1`, `UMAP 2`, `Cell type Blueprint_Encode`) %>% ggplot(aes(x=`UMAP 1`, y=`UMAP 2`, color=`Cell type Blueprint_Encode`)) + geom_point() + my_theme
redundant
We may want to remove redundant elements from the original data set (e.g., samples or transcripts), for example if we want to define cell-type specific signatures with low sample redundancy. remove_redundancy
takes as arguments a tibble, column names (as symbols; for sample
, transcript
and count
) and returns a tibble dropped recundant elements (e.g., samples). Two redundancy estimation approaches are supported:
Approach 1
{r drop, cache=TRUE} counts.norm.non_redundant = counts.norm.MDS %>% remove_redundancy( method = "correlation", .element = sample, .feature = transcript, .abundance = `count normalised` )
We can visualise how the reduced redundancy with the reduced dimentions look like
{r plot_drop, cache=TRUE} counts.norm.non_redundant %>% distinct(sample, `Dim 1`, `Dim 2`, `Cell type`) %>% ggplot(aes(x=`Dim 1`, y=`Dim 2`, color=`Cell type`)) + geom_point() + my_theme
Approach 2
{r drop2, cache=TRUE} counts.norm.non_redundant = counts.norm.MDS %>% remove_redundancy( method = "reduced_dimensions", .element = sample, .feature = transcript, Dim_a_column = `Dim 1`, Dim_b_column = `Dim 2` )
We can visualise MDS reduced dimensions of the samples with the closest pair removed.
{r plot_drop2, cache=TRUE} counts.norm.non_redundant %>% distinct(sample, `Dim 1`, `Dim 2`, `Cell type`) %>% ggplot(aes(x=`Dim 1`, y=`Dim 2`, color=`Cell type`)) + geom_point() + my_theme
The above wrapper streamline the most common processing of bulk RNA sequencing data. Other useful wrappers are listed above.
We can calculate gene counts (using FeatureCounts; Liao Y et al., 10.1093/nar/gkz114) from a list of BAM/SAM files and format them into a tidy structure (similar to counts).
counts = tidysc_cell_ranger( dir_names = "=filtered_feature_bc_matrix/", species = "Human" )
Every function takes this structure as input, and outputs either (i) the new information joint to the original input data frame (default), or (ii) just the new information, setting action="add" or action="get" respectively. For example, from this data set
tt.norm
action="add" (Default) We can add the MDS dimensions to the original data set
tt.norm %>% reduce_dimensions( method="PCA" , action="add" )
action="get" We can get just the MDS dimensions relative to each sample
tt.norm %>% reduce_dimensions( method="PCA" , action="get" )
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.