suppressPackageStartupMessages(library(tidyverse))

tidyomics/tidyomicsBlog_logo

In this first post, we want to introduce the concept of tidy transcriptomics and its principles.

What is tidy transcriptomics?

Tidy transcriptomics is a specific approach to transcriptomic data analysis in R that uses the "tidy" principles proposed by Wickham et al.. Tidy transcriptomics introduces both a tidy data representation and manipulation (for single-cell and bulk) with the packages tidySummarizedExperiment, tidySingleCellExperiment, and tidyseurat and a tidy analysis workflow (for bulk data) with the package tidybulk.

The tidy transcriptomics manifesto

This manifesto lays down the principles of tidy transcriptomics that tidybulk and tidySummarizedExperiment, tidySingleCellExperiment, and tidyseurat are based on. These principles are in line with the tidy tools manifesto.

Integration map

All packages are under active development and are in a maturing lifecycle.

roadmap

There are two parts to the tidy transcriptomics ecosystem: data and analysis framework. So far, for bulk RNA sequencing, both data (tidySummarizedExperiment) and analysis (tidybulk) frameworks are available. In contrast, for single-cell, only data frameworks are available (tidySingleCellExperiment and tidyseurat).

What data frameworks are and what they are not

Data frameworks are not data containers. Data frameworks are data abstractions that display and manipulate your existing containers (i.e. SummarizedExperiment, SingleCellExperiment and Seurat object) in a tidy manner. Therefore there is no such thing as a tidy* object. This has the advantage of allowing you to use tidyverse on transcriptomics data without compromising your existing pipelines. That is if a method works for SummarizedExperiment it works for its tidy representation.

Therefore, the question "can we go from tidySummarizedExperiment to SummarizedExperiment and vice versa" is not relevant, as we never leave SummarizedExperiment in the first place.

Giving a consistent interface despite different data containers

With tidy transcriptomics, we differentiate the data container (tibble, S4 objects) from the user interface (tidy). When we display, manipulate and visualise transcriptomic data, we might not care how the data is stored (tibble or hierarchical-rectangular structures) as long as the interface is consistent.

Which analysis framework can interface with which data framework

Bulk RNA sequencing data

Single-cell RNA sequencing data

Differences between tidy and base transcriptomics syntax

Here we provide few examples of the differences in programming transcriptomics analyses with tidy transcriptomics and the base R alternative. An important aspect is that tidy transcriptomics is complementary to standard workflows. Everything that works with SummarizedExperiment, SingleCellExperiment and Seurat works with their tidy representations.

Tidybulk and tidySummarizedExperiment

This example is taken from the workshop BioC2021.

As tidybulk smoothly integrates with ggplot2 and other tidyverse packages it can save on typing and make plots easier to generate. Compare the code for creating density plots with tidybulk versus standard base R below (standard code adapted from (Law et al. 2016)).

base R using edgeR

# Example code, no need to run
# Prepare data set
library(edgeR)
dgList <- SE2DGEList(airway)
group <- factor(dgList$samples$dex)
keep.exprs <- filterByExpr(dgList, group = group)
dgList <- dgList[keep.exprs, , keep.lib.sizes = FALSE]
nsamples <- ncol(dgList)
logcounts <- log2(dgList$counts)
# Setup graphics
col <- RColorBrewer::brewer.pal(nsamples, "Paired")
par(mfrow = c(1, 2))
# Plot raw counts
plot(density(logcounts[, 1]), col = col[1], lwd = 2, ylim = c(0, 0.26), las = 2, main = "", xlab = "")
title(main = "Counts")
for (i in 2:nsamples) {
  den <- density(logcounts[, i])
  lines(den$x, den$y, col = col[i], lwd = 2)
}
legend("topright", legend = dgList$samples$Run, text.col = col, bty = "n")
# Plot scaled counts
dgList_norm <- calcNormFactors(dgList)
lcpm_n <- cpm(dgList_norm, log = TRUE)
plot(density(lcpm_n[, 1]), col = col[1], lwd = 2, ylim = c(0, 0.26), las = 2, main = "", xlab = "")
title("Counts scaled")
for (i in 2:nsamples) {
  den <- density(lcpm_n[, i])
  lines(den$x, den$y, col = col[i], lwd = 2)
}
legend("topright", legend = dgList_norm$samples$Run, text.col = col, bty = "n")

tidybulk

# tidybulk
airway %>%
  keep_abundant(factor_of_interest = dex) %>%
  scale_abundance() %>%
  pivot_longer(cols = c("counts", "counts_scaled"), names_to = "source", values_to = "abundance") %>%
  ggplot(aes(x = abundance + 1, color = sample)) +
  geom_density() +
  facet_wrap(~source) +
  scale_x_log10() +
  custom_theme

SingleCellExperiment and tidySingleCellExperiment

suppressPackageStartupMessages(library(tidySingleCellExperiment))

These examples are taken from the article Interfacing Seurat with the R tidy universe.

Calculating a transcriptional signature.

Base

# Load data
pbmc_SCE = tidySingleCellExperiment::pbmc_small

signature_score_1 =
  pbmc_SCE[c("CD79A", "CD79B"),] %>%
  logcounts() %>%
  colSums() %>%
  scales::rescale(to = c(0,1))

signature_score_2 = 
  pbmc_SCE[c("CD3D", "CCR7"),] %>%
  logcounts() %>%
  colSums() %>%
  scales::rescale(to = c(0,1))

signature_score_1 - signature_score_2

tidySingleCellExperiment

# Load data
pbmc_SCE = tidySingleCellExperiment::pbmc_small

pbmc_SCE %>%
  join_features(c("CD79A", "CD79B", "CD3D", "CCR7"), assay = "logcounts", shape = "wide") %>%
  mutate(signature_score = 
           scales::rescale(CD79A + CD79B, to = c(0,1)) - 
           scales::rescale(CD3D + CCR7, to = c(0,1))
  ) %>%
  select(signature_score, everything())

Cell subsampling for mulisample balanced UMAP representation

Base

splits = 
  colnames(pbmc_SCE) %>%
  split(pbmc_SCE$file  )

min_size = splits %>%
  sapply(length) %>% min()

cell_subset = splits %>%
  lapply(function(x) sample(x, min_size)) %>%
  unlist()

pbmc_SCE[, cell_subset]

tidySingleCellExperiment

pbmc_SCE %>%
  add_count(file  , name = "tot_cells") %>%
  mutate(min_cells = min(tot_cells)) %>%
  group_by(file  ) %>%
  sample_n(min_cells)

tidyseurat

 detach(package:tidySingleCellExperiment, unload=TRUE)

Calculating a transcriptional signature.

Base

suppressPackageStartupMessages(library(Seurat))

# Load data
pbmc_Seurat = tidySingleCellExperiment::pbmc_small %>% as.Seurat() %>% NormalizeData()

signature_score_1 =
  pbmc_Seurat[c("CD79A", "CD79B"),] %>%
  Seurat::GetAssayData() %>%
  colSums() %>%
  scales::rescale(to = c(0,1))

signature_score_2 = 
  pbmc_Seurat[c("CD3D", "CCR7"),] %>%
  Seurat::GetAssayData() %>%
  colSums() %>%
  scales::rescale(to = c(0,1))

signature_score_1 - signature_score_2

tidyseurat

It is the same as for tidySingleCellExperiment

suppressPackageStartupMessages(library(tidyseurat))

pbmc_Seurat %>%
  join_features(c("CD79A", "CD79B", "CD3D", "CCR7"), shape = "wide") %>%
  mutate(signature_score = 
           scales::rescale(CD79A + CD79B, to = c(0,1)) - 
           scales::rescale(CD3D + CCR7, to = c(0,1))
  ) %>%
  select(signature_score, everything())

Cell subsampling for multisample balanced UMAP representation

Base

splits = 
  colnames(pbmc_Seurat) %>%
  split(pbmc_Seurat$file  )

min_size = splits %>%
  sapply(length) %>% min()

cell_subset = splits %>%
  lapply(function(x) sample(x, min_size)) %>%
  unlist()

pbmc_Seurat[, cell_subset]

tidyseurat

It is the same as for tidySingleCellExperiment

pbmc_Seurat %>%
  add_count(file  , name = "tot_cells") %>%
  mutate(min_cells = min(tot_cells)) %>%
  group_by(file  ) %>%
  sample_n(min_cells)

Learning material

We delivered a series of workshops on tidy transcriptomics, which tracks the progress of the ecosystem.

Bioc 2021

ISMB/ECCB 2021

R-Ladies Tunis 2021

Euro Bioc 2020

ABACBS 2020

Bioc Asia 2020

R/Pharma 2020

Bioc 2020

How to cite tidy transcriptomics

tidybulk

Mangiola, S., Molania, R., Dong, R. et al. tidybulk: an R tidy framework for modular transcriptomic data analysis. Genome Biol 22, 42 (2021). https://doi.org/10.1186/s13059-020-02233-7

tidyseurat

Stefano Mangiola, Maria A Doyle, Anthony T Papenfuss, Interfacing Seurat with the R tidy universe, Bioinformatics, 2021, https://doi.org/10.1093/bioinformatics/btab404

tidySummarizedExperiment and tidySingleCellExperiment

You can use tidyseurat citation as introduces the concepts of data abstraction.

{=html} <script src="https://utteranc.es/client.js" repo="stemangiola/stemangiola.github.io" issue-term="pathname" theme="github-light" crossorigin="anonymous" async> </script>



stemangiola/tidyTranscriptomics documentation built on March 14, 2024, 4:24 p.m.