Overview of the tidybulk package

Brings transcriptomics to the tidyverse!

Lifecycle:maturing

Functions/utilities available

Function | Description ------------ | ------------- identify_abundant | identify the abundant genes aggregate_duplicates | Aggregate abundance and annotation of duplicated transcripts in a robust way scale_abundance | Scale (normalise) abundance for RNA sequencing depth reduce_dimensions | Perform dimensionality reduction (PCA, MDS, tSNE) cluster_elements | Labels elements with cluster identity (kmeans, SNN) remove_redundancy | Filter out elements with highly correlated features adjust_abundance | Remove known unwanted variation (Combat) test_differential_abundance | Differential transcript abundance testing (DE) deconvolve_cellularity | Estimated tissue composition (Cibersort or llsr) test_differential_cellularity | Differential cell-type abundance testing keep_variable | Filter for top variable features keep_abundant | Filter out lowly abundant transcripts test_gene_enrichment | Gene enrichment analyses (EGSEA) test_gene_overrepresentation | Gene enrichment on list of transcript names (no rank)

Utilities | Description ------------ | ------------- get_bibliography | Get the bibliography of your workflow tidybulk | add tidybulk attributes to a tibble object tidybulk_SAM_BAM | Convert SAM BAM files into tidybulk tibble pivot_sample | Select sample-wise columns/information pivot_transcript | Select transcript-wise columns/information rotate_dimensions | Rotate two dimensions of a degree ensembl_to_symbol | Add gene symbol from ensembl IDs symbol_to_entrez | Add entrez ID from gene symbol describe_transcript | Add gene description from gene symbol impute_missing_abundance | Impute abundance for missing data points using sample groupings fill_missing_abundance | Fill abundance for missing data points using an arbitrary value

Minimal input data frame

sample | transcript | abundance | annotation ------------ | ------------- | ------------- | ------------- chr or fctr | chr or fctr | integer | ...

Output data frame

sample | transcript | abundance | annotation | new information ------------ | ------------- | ------------- | ------------- | ------------- chr or fctr | chr or fctr | integer | ... | ...

library(knitr)
#library(kableExtra)
knitr::opts_chunk$set(cache = TRUE, warning = FALSE,
                      message = FALSE, cache.lazy = FALSE)
#options(width = 120)
options(pillar.min_title_chars = Inf)


library(tibble)
library(dplyr)
library(magrittr)
library(tidyr)
library(ggplot2)
# library(widyr)
library(rlang)
library(purrr)
library(tidybulk)

my_theme =  
    theme_bw() +
    theme(
        panel.border = element_blank(),
        axis.line = element_line(),
        panel.grid.major = element_line(size = 0.2),
        panel.grid.minor = element_line(size = 0.1),
        text = element_text(size=12),
        legend.position="bottom",
        aspect.ratio=1,
        strip.background = element_blank(),
        axis.title.x  = element_text(margin = margin(t = 10, r = 10, b = 10, l = 10)),
        axis.title.y  = element_text(margin = margin(t = 10, r = 10, b = 10, l = 10))
    )

# counts_mini =
#   tidybulk::counts %>%
#   filter(transcript %in% (tidybulk::X_cibersort %>% rownames)) %>%
#   filter(sample %in% c("SRR1740034", "SRR1740035", "SRR1740058", "SRR1740043", "SRR1740067")) %>%
#   mutate(condition = ifelse(sample %in% c("SRR1740034", "SRR1740035", "SRR1740058"), TRUE, FALSE))

# se_mini
se_mini = tidybulk:::tidybulk_to_SummarizedExperiment(tidybulk::counts_mini, sample, transcript, count)
se_breast_tcga_mini = tidybulk:::tidybulk_to_SummarizedExperiment( tidybulk::breast_tcga_mini, sample, ens, `count`)
se.cibersort =
    tidybulk:::tidybulk_to_SummarizedExperiment(tidybulk::counts,  sample ,  transcript, count)
se.norm.batch =
    tidybulk:::tidybulk_to_SummarizedExperiment(tidybulk::counts,  sample ,  transcript, count) %>%
    scale_abundance()

Create tidybulk tibble. It memorises key column names

tt = counts_mini %>% tidybulk(sample, transcript, count)

All tidybulk methods are directly compatible with SummarizedExperiment as well.

Aggregate transcripts

tidybulk provide the aggregate_duplicates function to aggregate duplicated transcripts (e.g., isoforms, ensembl). For example, we often have to convert ensembl symbols to gene/transcript symbol, but in doing so we have to deal with duplicates. aggregate_duplicates takes a tibble and column names (as symbols; for sample, transcript and count) as arguments and returns a tibble with aggregate transcript with the same name. All the rest of the column are appended, and factors and boolean are appended as characters.

tt.aggr =  tt %>% aggregate_duplicates(     aggregation_function = sum )

tt.aggr

All functions are also directly compatible with SummarizedExperiment.

se.aggr =  se_mini %>% aggregate_duplicates(    aggregation_function = sum )

se.aggr

Scale counts

We may want to compensate for sequencing depth, scaling the transcript abundance (e.g., with TMM algorithm, Robinson and Oshlack doi.org/10.1186/gb-2010-11-3-r25). scale_abundance takes a tibble, column names (as symbols; for sample, transcript and count) and a method as arguments and returns a tibble with additional columns with scaled data as <NAME OF COUNT COLUMN>_scaled.

tt.norm =  tt.aggr %>% identify_abundant(factor_of_interest = condition) %>% scale_abundance(method="TMM")

tt.norm %>% select(`count`, count_scaled, .abundant, everything())

We can easily plot the scaled density to check the scaling outcome. On the x axis we have the log scaled counts, on the y axes we have the density, data is grouped by sample and coloured by cell type.

tt.norm %>%
    ggplot(aes(count_scaled + 1, group=sample, color=`Cell type`)) +
    geom_density() +
    scale_x_log10() +
    my_theme

All functions are also directly compatible with SummarizedExperiment.

se.norm =  se.aggr %>% identify_abundant(factor_of_interest = condition) %>% scale_abundance(method="TMM")

se.norm

Filter variable transcripts

We may want to identify and filter variable transcripts.

tt.norm.variable = tt.norm %>% keep_variable()

Reduce dimensions

We may want to reduce the dimensions of our data, for example using PCA or MDS algorithms. reduce_dimensions takes a tibble, column names (as symbols; for sample, transcript and count) and a method (e.g., MDS or PCA) as arguments and returns a tibble with additional columns for the reduced dimensions.

MDS (Robinson et al., 10.1093/bioinformatics/btp616)

tt.norm.MDS =  tt.norm %>% reduce_dimensions(.abundance = count_scaled, method="MDS", .dims = 3)

tt.norm.MDS %>% select(sample, contains("Dim"), `Cell type`, time ) %>% distinct()

On the x and y axes axis we have the reduced dimensions 1 to 3, data is coloured by cell type.

tt.norm.MDS %>%
    select(contains("Dim"), sample, `Cell type`) %>%
  distinct() %>%
  GGally::ggpairs(columns = 1:3, ggplot2::aes(colour=`Cell type`))

All functions are also directly compatible with SummarizedExperiment.

se.norm.MDS =  se.norm %>% reduce_dimensions(.abundance = count_scaled, method="MDS", .dims = 3)

se.norm.MDS

PCA

tt.norm.PCA = tt.norm %>% reduce_dimensions(.abundance = count_scaled, method="PCA" ,  .dims = 3)

tt.norm.PCA %>% select(sample, contains("PC"), `Cell type`, time ) %>% distinct()

On the x and y axes axis we have the reduced dimensions 1 to 3, data is coloured by cell type.

tt.norm.PCA %>%
    select(contains("PC"), sample, `Cell type`) %>%
  distinct() %>%
  GGally::ggpairs(columns = 1:3, ggplot2::aes(colour=`Cell type`))

All functions are also directly compatible with SummarizedExperiment.

se.norm.PCA = se.norm %>% reduce_dimensions(.abundance = count_scaled, method="PCA" ,  .dims = 3)

se.norm.PCA

tSNE

tt_tcga_breast =
    tidybulk::breast_tcga_mini %>%
    tidybulk(sample, ens, `count`)
tt.norm.tSNE =
    tt_tcga_breast %>%
    identify_abundant() %>%
    reduce_dimensions(
        .abundance = count_scaled,
        method = "tSNE",
        top = 500,
        perplexity=10,
        pca_scale =TRUE
    )

tt.norm.tSNE %>%
    select(contains("tSNE", ignore.case = FALSE), sample, Call) %>%
    distinct()

tt.norm.tSNE %>%
    pivot_sample() %>%
    ggplot(aes(x = `tSNE1`, y = `tSNE2`, color=Call)) + geom_point() + my_theme

All functions are also directly compatible with SummarizedExperiment.

se.norm.tSNE =
    se_breast_tcga_mini %>%
    identify_abundant() %>%
    reduce_dimensions(
        .abundance = count_scaled,
        method = "tSNE",
        top = 500,
        perplexity=10,
        pca_scale =TRUE
    )
se.norm.tSNE

Rotate dimensions

We may want to rotate the reduced dimensions (or any two numeric columns really) of our data, of a set angle. rotate_dimensions takes a tibble, column names (as symbols; for sample, transcript and count) and an angle as arguments and returns a tibble with additional columns for the rotated dimensions. The rotated dimensions will be added to the original data set as <NAME OF DIMENSION> rotated <ANGLE> by default, or as specified in the input arguments.

tt.norm.MDS.rotated =
  tt.norm.MDS %>%
    rotate_dimensions(`Dim1`, `Dim2`, rotation_degrees = 45, .element = sample)

Original On the x and y axes axis we have the first two reduced dimensions, data is coloured by cell type.

tt.norm.MDS.rotated %>%
    pivot_sample() %>%
    ggplot(aes(x=`Dim1`, y=`Dim2`, color=`Cell type` )) +
  geom_point() +
  my_theme

Rotated On the x and y axes axis we have the first two reduced dimensions rotated of 45 degrees, data is coloured by cell type.

tt.norm.MDS.rotated %>%
    pivot_sample() %>%
    ggplot(aes(x=`Dim1 rotated 45`, y=`Dim2 rotated 45`, color=`Cell type` )) +
  geom_point() +
  my_theme

All functions are also directly compatible with SummarizedExperiment.

se.norm.MDS %>%
rotate_dimensions(`Dim1`, `Dim2`, rotation_degrees = 45, .element = sample)

Test differential abundance

We may want to test for differential transcription between sample-wise factors of interest (e.g., with edgeR). test_differential_abundance takes a tibble, column names (as symbols; for sample, transcript and count) and a formula representing the desired linear model as arguments and returns a tibble with additional columns for the statistics from the hypothesis test (e.g., log fold change, p-value and false discovery rate).

tt %>% identify_abundant(factor_of_interest = condition) %>%    test_differential_abundance(  ~ condition,  action="only")

All functions are also directly compatible with SummarizedExperiment.

se_mini %>% test_differential_abundance(  ~ condition)

Adjust counts

We may want to adjust counts for (known) unwanted variation. adjust_abundance takes as arguments a tibble, column names (as symbols; for sample, transcript and count) and a formula representing the desired linear model where the first covariate is the factor of interest and the second covariate is the unwanted variation, and returns a tibble with additional columns for the adjusted counts as <COUNT COLUMN>_adjusted. At the moment just an unwanted covariated is allowed at a time.

tt.norm.batch =
    tt.norm %>%

      # Add fake batch and factor of interest
      left_join(
        (.) %>%
            distinct(sample) %>%
            mutate(batch = c(0,1,0,1,1))
      ) %>%
        mutate(factor_of_interest = `Cell type` == "b_cell")
tt.norm.adj =
    tt.norm.batch %>%
      adjust_abundance(
        ~ factor_of_interest + batch,
        .abundance = count_scaled,
        action = "only"
      )

tt.norm.adj

All functions are also directly compatible with SummarizedExperiment.

se.norm.batch %>%
  adjust_abundance(
    ~ factor_of_interest + batch,
    .abundance = count_scaled
  )

Deconvolve Cell type composition

We may want to infer the cell type composition of our samples (with the algorithm Cibersort; Newman et al., 10.1038/nmeth.3337). deconvolve_cellularity takes as arguments a tibble, column names (as symbols; for sample, transcript and count) and returns a tibble with additional columns for the adjusted cell type proportions.

columns truncated

tt.cibersort =
    tt %>%
    deconvolve_cellularity(action="get", cores=1)

tt.cibersort %>% select(sample, contains("cibersort:")) 

With the new annotated data frame, we can plot the distributions of cell types across samples, and compare them with the nominal cell type labels to check for the purity of isolation. On the x axis we have the cell types inferred by Cibersort, on the y axis we have the inferred proportions. The data is facetted and coloured by nominal cell types (annotation given by the researcher after FACS sorting).

tt.cibersort %>%
    gather(`Cell type inferred`, `proportion`, 5:26) %>%
  distinct(sample, `Cell type`, `Cell type inferred`, proportion) %>%
  ggplot(aes(x=`Cell type inferred`, y=proportion, fill=`Cell type`)) +
  geom_boxplot() +
  facet_wrap(~`Cell type`) +
  my_theme +
  theme(axis.text.x = element_text(angle = 90, hjust = 1, vjust = 0.5), aspect.ratio=1/5)

All functions are also directly compatible with SummarizedExperiment.

se.cibersort %>% deconvolve_cellularity(cores=1)

Cluster samples

We may want to cluster our data (e.g., using k-means sample-wise). cluster_elements takes as arguments a tibble, column names (as symbols; for sample, transcript and count) and returns a tibble with additional columns for the cluster annotation. At the moment only k-means clustering is supported, the plan is to introduce more clustering methods.

k-means

tt.norm.cluster = tt.norm %>%
  cluster_elements(.abundance = count_scaled, method="kmeans",  centers = 2 )

tt.norm.cluster

We can add cluster annotation to the MDS dimesion reduced data set and plot.

 tt.norm.MDS %>%
  cluster_elements(
    .abundance = count_scaled,
    method="kmeans",
    centers = 2,
    action="get"
  ) %>%
    ggplot(aes(x=`Dim1`, y=`Dim2`, color=`cluster kmeans`)) +
  geom_point() +
  my_theme

All functions are also directly compatible with SummarizedExperiment.

se.norm %>%
  cluster_elements(.abundance = count_scaled, method="kmeans",  centers = 2 )

SNN

tt.norm.SNN =   tt.norm.tSNE %>%    cluster_elements(.abundance= count_scaled, method = "SNN")

tt.norm.SNN %>%
    pivot_sample()

tt.norm.SNN %>%
    select(contains("tSNE", ignore.case = FALSE), `cluster SNN`, sample, Call) %>%
    gather(source, Call, c("cluster SNN", "Call")) %>%
    distinct() %>%
    ggplot(aes(x = `tSNE1`, y = `tSNE2`, color=Call)) + geom_point() + facet_grid(~source) + my_theme


# Do differential transcription between clusters
tt.norm.SNN %>%
    mutate(factor_of_interest = `cluster SNN` == 3) %>%
    test_differential_abundance(
    ~ factor_of_interest,
    action="only"
   )

All functions are also directly compatible with SummarizedExperiment.

se.norm.tSNE %>%    cluster_elements(.abundance= count_scaled, method = "SNN")

Drop redundant

We may want to remove redundant elements from the original data set (e.g., samples or transcripts), for example if we want to define cell-type specific signatures with low sample redundancy. remove_redundancy takes as arguments a tibble, column names (as symbols; for sample, transcript and count) and returns a tibble dropped recundant elements (e.g., samples). Two redundancy estimation approaches are supported:

Approach 1

tt.norm.non_redundant = tt.norm.MDS %>%  remove_redundancy( method = "correlation" )

We can visualise how the reduced redundancy with the reduced dimentions look like

tt.norm.non_redundant %>%
    pivot_sample() %>%
    ggplot(aes(x=`Dim1`, y=`Dim2`, color=`Cell type`)) +
  geom_point() +
  my_theme

All functions are also directly compatible with SummarizedExperiment.

se.norm.MDS %>%  remove_redundancy( method = "correlation" )

Approach 2

tt.norm.non_redundant =
    tt.norm.MDS %>%
  remove_redundancy(
    method = "reduced_dimensions",
    .element = sample,
    .feature = transcript,
    Dim_a_column = `Dim1`,
    Dim_b_column = `Dim2`
  )

We can visualise MDS reduced dimensions of the samples with the closest pair removed.

tt.norm.non_redundant %>%
    pivot_sample() %>%
    ggplot(aes(x=`Dim1`, y=`Dim2`, color=`Cell type`)) +
  geom_point() +
  my_theme

All functions are also directly compatible with SummarizedExperiment.

se.norm.MDS %>%
remove_redundancy(
    method = "reduced_dimensions",
    .element = sample,
    .feature = transcript,
    Dim_a_column = `Dim1`,
    Dim_b_column = `Dim2`
)

Other useful wrappers

The above wrapper streamline the most common processing of bulk RNA sequencing data. Other useful wrappers are listed above.

From BAM/SAM to tibble of gene counts

We can calculate gene counts (using FeatureCounts; Liao Y et al., 10.1093/nar/gkz114) from a list of BAM/SAM files and format them into a tidy structure (similar to counts).

counts = tidybulk_SAM_BAM(
    file_names,
    genome = "hg38",
    isPairedEnd = TRUE,
    requireBothEndsMapped = TRUE,
    checkFragLength = FALSE,
    useMetaFeatures = TRUE
)

From ensembl IDs to gene symbol IDs

We can add gene symbols from ensembl identifiers. This is useful since different resources use ensembl IDs while others use gene symbol IDs.

counts_ensembl %>% ensembl_to_symbol(ens)

ADD versus GET versus ONLY modes

Every function takes a tidytranscriptomics structured data as input, and (i) with action="add" outputs the new information joint to the original input data frame (default), (ii) with action="get" the new information with the sample or transcript relative informatin depending on what the analysis is about, or (iii) with action="only" just the new information. For example, from this data set

  tt.norm

action="add" (Default) We can add the MDS dimensions to the original data set

  tt.norm %>%
    reduce_dimensions(
        .abundance = count_scaled,
        method="MDS" ,
        .element = sample,
        .feature = transcript,
        .dims = 3,
        action="add"
    )

action="get" We can add the MDS dimensions to the original data set selecting just the sample-wise column

  tt.norm %>%
    reduce_dimensions(
        .abundance = count_scaled,
        method="MDS" ,
        .element = sample,
        .feature = transcript,
        .dims = 3,
        action="get"
    )

action="only" We can get just the MDS dimensions relative to each sample

  tt.norm %>%
    reduce_dimensions(
        .abundance = count_scaled,
        method="MDS" ,
        .element = sample,
        .feature = transcript,
        .dims = 3,
        action="only"
    )

Appendix

sessionInfo()


Try the tidybulk package in your browser

Any scripts or data that you put into this service are public.

tidybulk documentation built on April 7, 2021, 6 p.m.