knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>",
  eval = TRUE
)

Introduction

Palatine tonsils are under constant exposure to antigens via the upper respiratory tract, which makes them a compelling model secondary lymphoid organ (SLO) to study the interplay between innate and adaptive immune cells [@ruddle2009secondary]. Tonsils have a non-keratinizing stratified squamous epithelium, organized into tubular, branched crypts that enlarge the tonsillar surface. Within the crypts, microfold cells (or M cells) sample antigens at their apical membrane. Subsequently, antigen presenting cells (APC), such as dendritic cells (DC), process and present antigens to T cells in the interfollicular or T cell zone. Alternatively, antigens are kept intact by follicular dendritic cells (FDC) in lymphoid follicles, where they are recognized by B cells [@nave2001morphology]. Such recognition triggers the GC reaction, whereby naive B cells (NBC) undergo clonal selection, proliferation, somatic hypermutation, class switch recombination (CSR) and differentiation into long-lived plasma cells (PC) or memory B cells (MBC) [@de2015dynamics].

In the context of the Human Cell Atlas (HCA) [@regev2018human], we have created a taxonomy of 121 cell types and states in a human tonsil. Since the transcriptome is just a snapshot of a cell's state [@wagner2016revealing], we have added other layers to define cell identity: single-cell resolved open chromatin epigenomic landscapes (scATAC-seq and scRNA/ATAC-seq; i.e. Multiome) as well as protein (CITE-seq), adaptive repertoire (single-cell B and T cell receptor sequencing; i.e. scBCR-seq and scTCR-seq) and spatial transcriptomics (ST) profiles.

The HCATonsilData package aims to provide programmatic and modular access to the datasets of the different modalities and cell types of the tonsil atlas. HCATonsilData also documents how the dataset was generated and archived in different repositories, from raw fastq files to processed datasets. It also explains in detail the cell- and sample-level metadata, and allows users to traceback the annotation of each cell type through detailed Glossary.

Installation

HCATonsilData is available in BioConductor and can be installed as follows:

if (!require("BiocManager", quietly = TRUE))
    install.packages("BiocManager")

BiocManager::install("HCATonsilData")

Load the necessary packages in R:

library("HCATonsilData")
library("SingleCellExperiment")
library("ExperimentHub")
library("scater")
library("ggplot2")
library("knitr")
library("kableExtra")
library("htmltools")

Overview of the dataset

We obtained a total of 17 human tonsils, which covered three age groups: children (n=6, 3-5 years), young adults (n=8, 19-35 years), and old adults (n=3, 56-65 years). We collected them in a discovery cohort (n=10), which we used to cluster and annotate cell types; and a validation cohort (n=7), which we used to validate the presence, annotation and markers of the discovered cell types. The following table corresponds to the donor-level metadata:

data("donor_metadata")

kable(
  donor_metadata,
  format = "markdown",
  caption = "Donor Metadata",
  align = "c"
) |> kable_styling(full_width = FALSE)

These tonsil samples were processed with different data modalities:

The following heatmap informs about which samples were sequenced with which technology and cohort type:

knitr::include_graphics("tonsil_atlas_cohort.png")

For each data modality, we mapped all raw fastq files with the appropriate flavor of cellranger (e.g. cellranger-atac, spaceranger, etc), and analyzed all datasets within the Seurat ecosystem [@hao2021integrated]. All fastq files are deposited at EGA under accession number EGAS00001006375. The expression and accessibility matrices are deposited in Zenodo under record 6678331. The Seurat objects from which the data of this repository is derived are archived in Zenodo under record 8373756. Finally, the GitHub repository Single-Cell-Genomics-Group-CNAG-CRG/TonsilAtlas documents the full end-to-end analysis, from raw data to final Seurat objects and figures in the paper.

To define the tonsillar cell types, we first aimed to cluster cells using the combined expression data from scRNA-seq, Multiome and CITE-seq datasets. However, we noticed that with CITE-seq we detected 2.75X and 2.98X fewer genes than with scRNA-seq and Multiome, respectively:

knitr::include_graphics("tonsil_atlas_n_detected_genes.png")

Because we did not want to bias the clustering towards the modality that provides less information, we first analyzed scRNA-seq and Multiome together, and used CITE-seq to validate the annotation (see below). However, a recent benchmarking effort showed that scRNA-seq and single-nuclei RNA-seq (such as Multiome) mix very poorly [@mereu2020benchmarking]. Indeed, we observed massive batch effects between both modalities (see manuscript). To overcome it, we found highly variable genes (HVG) for each modality independently. We then intersected both sets of HVG to remove modality-specific variation. Following principal component analysis (PCA), we integrated scRNA-seq and Multiome with Harmony [@korsunsky2019fast]. Harmony scales well to atlas-level datasets and ranks among the best-performing tools in different benchmarks.

These steps are performed using SLOcatoR (see methods of the paper), which allowed 1. to remove batch effects between scRNA-seq and Multiome, 2. to transfer cell type labels from Multiome to scATAC-seq, and from scRNA-seq to CITE-seq. Thus, this strategy allowed us to annotate clusters using information across multiple modalities (RNA expression, protein expression, TF motifs, etc.).

Note: we provide SLOcatoR as an R package to promote reproducibility and best practices in scientific computing, but we refer the users to properly benchmarked tools to annotate their datasets such as Azimuth, scArches, or Symphony.

Within the discovery cohort, we first defined 9 main compartments: (1) NBC and MBC, (2) GC B cells (GCBC), (3) PC, (4) CD4 T cells, (5) cytotoxic cells [CD8 T cells, NK, ILC, double negative T cells (DN)], (6) myeloid cells (DC, macrophages, monocytes, granulocytes, mast cells), (7) FDC, (8) epithelial cells and (9) plasmacytoid dendritic cells (PDC):

knitr::include_graphics("tonsil_atlas_umap_9_compartments.png")

Subsequently, we clustered and annotated cell subtypes within each of the main compartments, giving rise to the 121 cell types and states in the atlas. The following UMAPs inform about the main cell populations and number of cells per assay in the discovery cohort:

knitr::include_graphics("tonsil_atlas_umaps.png")

We then applied SLOcatoR to annotate cells from the validation cohort (1) to confirm the presence, annotation and markers of cell types; (2) to extend the atlas through an integrated validation dataset, and (3) to chart compositional changes in the tonsil during aging. The integrated tonsil atlas represented over 462,000 single-cell transcriptomes:

knitr::include_graphics("tonsil_atlas_umap_discovery_validation_cohort.png")

Finally, we focused again on each of the 9 compartments and retransferred the final annotation level. As an example, here we can see the transferred annotation for CD4 T cells, from discovery to validation cohort:

knitr::include_graphics("tonsil_atlas_umap_discovery_validation_cohort_CD4.png")

Overall, we could validate most of the cell types, as they preserved their marker genes and had a high annotation confidence in the validation cohort. However, (1) transient cell states, (2) underrepresented cell types, or (3) clusters that are at the intersection between two main compartments (and might represent doublets) had low annotation confidences. We provide this interpretation in the Glossary (see below).

Assay types

HCATonsilData provides access to 5 main types of assays: RNA, ATAC, Multiome, CITE-seq and Spatial.

scRNA-seq

We can obtain the SingleCellExperiment object with gene expression (RNA) data as follows:

(sce <- HCATonsilData(assayType = "RNA", cellType = "All"))
table(sce$assay)

This object consists of 377,988 profiled with scRNA-seq (3P) and 84,364 cells profiled with multiome, for a total of 462,352 cells (37,378 genes were quantified across all of these).

We can dowload a SingleCellExperiment object specific to each of the main subpopulations defined at level 1 as follows:

listCellTypes(assayType = "RNA")
(epithelial <- HCATonsilData(assayType = "RNA", cellType = "epithelial"))
data("colors_20230508")
scater::plotUMAP(epithelial, colour_by = "annotation_20230508") +
  ggplot2::scale_color_manual(values = colors_20230508$epithelial) +
  ggplot2::theme(legend.title = ggplot2::element_blank())

As we can see, we can use the same colors as in the publication using the colors_20230508 variable.

We can also download datasets from the preprint (discovery cohort, version 1.0) - we first load a list of matched annotations used throughout the course of the project, used internally by the main HCATonsilData() function:

data("annotations_dictionary")
annotations_dictionary[["dict_20220619_to_20230508"]]

# load also a predefined palette of colors, to match the ones used in the manuscript 
data("colors_20230508")

(epithelial_discovery <- HCATonsilData("RNA", "epithelial", version = "1.0"))
scater::plotUMAP(epithelial, colour_by = "annotation_20230508") +
  ggplot2::scale_color_manual(values = colors_20230508$epithelial) +
  ggplot2::theme(legend.title = ggplot2::element_blank())

Here's a brief explanation of all the variables in the colData slot of the SingleCellExperiment objects:

scATAC-seq and Multiome

Since there is not a popular Bioconductor package to analyze or store scATAC-seq data, we point users to the scATAC-seq and Multiome Seurat objects that we created using Signac [@stuart2021single].

Here are the instructions to download the scATAC-seq object (approximately ~9.3 Gb in size):

library("Seurat")
library("Signac")

download_dir = tempdir()

options(timeout = 10000000)
atac_url <- "https://zenodo.org/record/8373756/files/TonsilAtlasSeuratATAC.tar.gz"
download.file(
  url = atac_url, 
  destfile = file.path(download_dir, "TonsilAtlasSeuratATAC.tar.gz")
)
# Advice: check that the md5sum is the same as the one in Zenodo
untar(
  tarfile = file.path(download_dir, "TonsilAtlasSeuratATAC.tar.gz"), 
  exdir = download_dir
)
atac_seurat <- readRDS(
  file.path(download_dir, "scATAC-seq/20230911_tonsil_atlas_atac_seurat_obj.rds")
)
atac_seurat

The multiome object contains 68,749 cells that passed both RNA and ATAC QC filters. Here are the instructions to download the Multiome object (approximately ~5.7 Gb in size):

library("Seurat")
library("Signac")

download_dir = tempdir()

options(timeout = 10000000)
multiome_url <- "https://zenodo.org/record/8373756/files/TonsilAtlasSeuratMultiome.tar.gz"
download.file(
  url = multiome_url, 
  destfile = file.path(download_dir, "TonsilAtlasSeuratMultiome.tar.gz")
)
# Advice: check that the md5sum is the same as the one in Zenodo
untar(
  tarfile = file.path(download_dir, "TonsilAtlasSeuratMultiome.tar.gz"), 
  exdir = download_dir
)
multiome_seurat <- readRDS(
  file.path(download_dir, "/multiome/20230911_tonsil_atlas_multiome_seurat_obj.rds")
)
multiome_seurat

Note: we may version these two objects in Zenodo to polish the annotation. Therefore, we recommend users to check that this is the most updated version of the object in Zenodo.

To recall peaks or visualize chromatin tracks, it is essential to have access to the fragments file. We modified the original fragments files generated by cellranger-atac or cellranger-arc to include the gem_id as prefix. These files can be downloaded as follows (approximate size of ~26.0 Gb, feel free to grab a coffee in the meanwhile):

download_dir = tempdir()

fragments_url <- "https://zenodo.org/record/8373756/files/fragments_files.tar.gz"
download.file(
  url = fragments_url, 
  destfile = file.path(download_dir, "fragments_files.tar.gz")
)
untar(
  tarfile = file.path(download_dir, "fragments_files.tar.gz"),
  exdir = download_dir
)

Note: it is paramount to update the paths to the fragments files in each ChromatinAssay object using the UpdatePath function.

The chromatin accessibility side of these objects was processed using the general pipeline from Signac. Briefly:

For Multiome, we additionally computed a weighted KNN graph with the FindMultiModalNeighbors function.

Many variables in the metadata slot are shared with scRNA-seq. Here, we explain the meaning of the variables that are specific to Multiome or scATAC-seq:

CITE-seq

The CITE-seq object can be downloaded as follows (approximate size of ~0.4 Gb):

library("Seurat")

download_dir = tempdir()

options(timeout = 10000000)
cite_url <- "https://zenodo.org/record/8373756/files/TonsilAtlasSeuratCITE.tar.gz"
download.file(
  url = cite_url, 
  destfile = file.path(download_dir, "TonsilAtlasSeuratCITE.tar.gz")
)
# Advice: check that the md5sum is the same as the one in Zenodo
untar(
  tarfile = file.path(download_dir, "TonsilAtlasSeuratCITE.tar.gz"), 
  exdir = download_dir
)

cite_seurat <- readRDS(
  file.path(download_dir, "CITE-seq/20220215_tonsil_atlas_cite_seurat_obj.rds")
)
cite_seurat

Together with CITE-seq, we also applied scBCR-seq and scTCR-seq. This data was analyzed with scirpy, [@sturm2020scirpy] and we provide the main output of this analysis (for cytotoxic cells) as a dataframe, which can be imported as follows:

scirpy_df <- read.csv(
  file = file.path(download_dir, "CITE-seq/scirpy_tcr_output.tsv"),
  header = TRUE
)

head(scirpy_df)

Spatial transcriptomics

A r BiocStyle::Biocpkg("SpatialExperiment") of the spatial transcriptomics dataset may be retrieved via assayType="Spatial". The dataset contains 8 tissue slices with ~1,000-3,000 cells each that were profiled on two separated slides, as well as a low-resolution (H&E staining) image for each slice.

library("SpatialExperiment")
(spe <- HCATonsilData(assayType = "Spatial"))

To plot gene expression you can use the ggspavis package:

library("ggspavis")
sub <- spe[, spe$sample_id == "esvq52_nluss5"]
plt <- plotVisium(sub, fill="SELENOP") + 
  scale_fill_gradientn(colors=rev(hcl.colors(9, "Spectral")))
plt$layers[[2]]$aes_params$size <- 1.5
plt$layers[[2]]$aes_params$alpha <- 1
plt$layers[[2]]$aes_params$stroke <- NA
plt

Annotations of the Tonsil Data Atlas

To allow users to traceback the rationale behind each and every of our annotations, we provide a detailed glossary of 121 cell types and states and related functions to get the explanation, markers and references of every annotation. You can access the glossary as a dataframe as follows:

glossary_df <- TonsilData_glossary()
head(glossary_df)

To get the glossary for each cell type with nice printing formatting you can use the TonsilData_cellinfo() function:

TonsilData_cellinfo("Tfr")

Alternatively, you can get a static html with links to markers and articles with TonsilData_cellinfo_html

htmltools::includeHTML(
  TonsilData_cellinfo_html("Tfr", display_plot = TRUE)
)

Although we have put massive effort in annotating tonsillar cell types, cell type annotations are dynamic by nature. New literature or other interpretations of the data can challenge and refine our annotations. To accommodate this, we have developed the updateAnnotation function, which allows to periodically provide newer annotations as extra columns in the colData slot of the SingleCellExperiment objects. If you want to contribute in one of these versions of the upcoming annotations, please open an issue and describe your annotation.

Interoperability with other frameworks

While we provide data in the form of SingleCellExperiment objects, you may want to analyze your data using a different single-cell data container. In future releases, we will strive to make the SingleCellExperiment objects compatible with Seurat v5, which will come out after this release of BioConductor. Alternatively, you may want to obtain AnnData objects to analyze your data in scanpy ecosystem. You can convert and save the data as follows:

library("zellkonverter")
epithelial <- HCATonsilData(assayType = "RNA", cellType = "epithelial")

epi_anndatafile <- file.path(tempdir(), "epithelial.h5ad")

writeH5AD(sce = epithelial, file = epi_anndatafile)

The SingleCellExperiment objects obtained via HCATonsilData() can be explored in detail using e.g. additional Bioconductor packages, such as the iSEE package.

This can be as simple as executing this chunk:

if (require(iSEE)) {
  iSEE(epithelial)
}

Session information {-}

sessionInfo()

References



massonix/HCATonsilData documentation built on May 7, 2024, 8:33 a.m.