In massonix/HCATonsilData: Provide programmatic access to the tonsil cell atlas datasets

knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>",
  eval = TRUE
)

Introduction

Palatine tonsils are under constant exposure to antigens via the upper respiratory tract, which makes them a compelling model secondary lymphoid organ (SLO) to study the interplay between innate and adaptive immune cells [@ruddle2009secondary]. Tonsils have a non-keratinizing stratified squamous epithelium, organized into tubular, branched crypts that enlarge the tonsillar surface. Within the crypts, microfold cells (or M cells) sample antigens at their apical membrane. Subsequently, antigen presenting cells (APC), such as dendritic cells (DC), process and present antigens to T cells in the interfollicular or T cell zone. Alternatively, antigens are kept intact by follicular dendritic cells (FDC) in lymphoid follicles, where they are recognized by B cells [@nave2001morphology]. Such recognition triggers the GC reaction, whereby naive B cells (NBC) undergo clonal selection, proliferation, somatic hypermutation, class switch recombination (CSR) and differentiation into long-lived plasma cells (PC) or memory B cells (MBC) [@de2015dynamics].

In the context of the Human Cell Atlas (HCA) [@regev2018human], we have created a taxonomy of 121 cell types and states in a human tonsil. Since the transcriptome is just a snapshot of a cell's state [@wagner2016revealing], we have added other layers to define cell identity: single-cell resolved open chromatin epigenomic landscapes (scATAC-seq and scRNA/ATAC-seq; i.e. Multiome) as well as protein (CITE-seq), adaptive repertoire (single-cell B and T cell receptor sequencing; i.e. scBCR-seq and scTCR-seq) and spatial transcriptomics (ST) profiles.

The HCATonsilData package aims to provide programmatic and modular access to the datasets of the different modalities and cell types of the tonsil atlas. HCATonsilData also documents how the dataset was generated and archived in different repositories, from raw fastq files to processed datasets. It also explains in detail the cell- and sample-level metadata, and allows users to traceback the annotation of each cell type through detailed Glossary.

Installation

HCATonsilData is available in BioConductor and can be installed as follows:

if (!require("BiocManager", quietly = TRUE))
    install.packages("BiocManager")

BiocManager::install("HCATonsilData")

Load the necessary packages in R:

library("HCATonsilData")
library("SingleCellExperiment")
library("ExperimentHub")
library("scater")
library("ggplot2")
library("knitr")
library("kableExtra")
library("htmltools")

Overview of the dataset

We obtained a total of 17 human tonsils, which covered three age groups: children (n=6, 3-5 years), young adults (n=8, 19-35 years), and old adults (n=3, 56-65 years). We collected them in a discovery cohort (n=10), which we used to cluster and annotate cell types; and a validation cohort (n=7), which we used to validate the presence, annotation and markers of the discovered cell types. The following table corresponds to the donor-level metadata:

data("donor_metadata")

kable(
  donor_metadata,
  format = "markdown",
  caption = "Donor Metadata",
  align = "c"
) |> kable_styling(full_width = FALSE)

These tonsil samples were processed with different data modalities:

scRNA-seq (10X Chromium v3)
scATAC-seq (10X Chromium)
10X Multiome: joint RNA and ATAC for each cell.
CITE-seq: joint transcriptome + ~200 protein surface markers for each cell.
Spatial transcriptomics

The following heatmap informs about which samples were sequenced with which technology and cohort type:

knitr::include_graphics("tonsil_atlas_cohort.png")

For each data modality, we mapped all raw fastq files with the appropriate flavor of cellranger (e.g. cellranger-atac, spaceranger, etc), and analyzed all datasets within the Seurat ecosystem [@hao2021integrated]. All fastq files are deposited at EGA under accession number EGAS00001006375. The expression and accessibility matrices are deposited in Zenodo under record 6678331. The Seurat objects from which the data of this repository is derived are archived in Zenodo under record 8373756. Finally, the GitHub repository Single-Cell-Genomics-Group-CNAG-CRG/TonsilAtlas documents the full end-to-end analysis, from raw data to final Seurat objects and figures in the paper.

To define the tonsillar cell types, we first aimed to cluster cells using the combined expression data from scRNA-seq, Multiome and CITE-seq datasets. However, we noticed that with CITE-seq we detected 2.75X and 2.98X fewer genes than with scRNA-seq and Multiome, respectively:

knitr::include_graphics("tonsil_atlas_n_detected_genes.png")

Because we did not want to bias the clustering towards the modality that provides less information, we first analyzed scRNA-seq and Multiome together, and used CITE-seq to validate the annotation (see below). However, a recent benchmarking effort showed that scRNA-seq and single-nuclei RNA-seq (such as Multiome) mix very poorly [@mereu2020benchmarking]. Indeed, we observed massive batch effects between both modalities (see manuscript). To overcome it, we found highly variable genes (HVG) for each modality independently. We then intersected both sets of HVG to remove modality-specific variation. Following principal component analysis (PCA), we integrated scRNA-seq and Multiome with Harmony [@korsunsky2019fast]. Harmony scales well to atlas-level datasets and ranks among the best-performing tools in different benchmarks.

These steps are performed using SLOcatoR (see methods of the paper), which allowed 1. to remove batch effects between scRNA-seq and Multiome, 2. to transfer cell type labels from Multiome to scATAC-seq, and from scRNA-seq to CITE-seq. Thus, this strategy allowed us to annotate clusters using information across multiple modalities (RNA expression, protein expression, TF motifs, etc.).

Note: we provide SLOcatoR as an R package to promote reproducibility and best practices in scientific computing, but we refer the users to properly benchmarked tools to annotate their datasets such as Azimuth, scArches, or Symphony.

Within the discovery cohort, we first defined 9 main compartments: (1) NBC and MBC, (2) GC B cells (GCBC), (3) PC, (4) CD4 T cells, (5) cytotoxic cells [CD8 T cells, NK, ILC, double negative T cells (DN)], (6) myeloid cells (DC, macrophages, monocytes, granulocytes, mast cells), (7) FDC, (8) epithelial cells and (9) plasmacytoid dendritic cells (PDC):

knitr::include_graphics("tonsil_atlas_umap_9_compartments.png")

Subsequently, we clustered and annotated cell subtypes within each of the main compartments, giving rise to the 121 cell types and states in the atlas. The following UMAPs inform about the main cell populations and number of cells per assay in the discovery cohort:

knitr::include_graphics("tonsil_atlas_umaps.png")

We then applied SLOcatoR to annotate cells from the validation cohort (1) to confirm the presence, annotation and markers of cell types; (2) to extend the atlas through an integrated validation dataset, and (3) to chart compositional changes in the tonsil during aging. The integrated tonsil atlas represented over 462,000 single-cell transcriptomes:

knitr::include_graphics("tonsil_atlas_umap_discovery_validation_cohort.png")

Finally, we focused again on each of the 9 compartments and retransferred the final annotation level. As an example, here we can see the transferred annotation for CD4 T cells, from discovery to validation cohort:

knitr::include_graphics("tonsil_atlas_umap_discovery_validation_cohort_CD4.png")

Overall, we could validate most of the cell types, as they preserved their marker genes and had a high annotation confidence in the validation cohort. However, (1) transient cell states, (2) underrepresented cell types, or (3) clusters that are at the intersection between two main compartments (and might represent doublets) had low annotation confidences. We provide this interpretation in the Glossary (see below).

Assay types

HCATonsilData provides access to 5 main types of assays: RNA, ATAC, Multiome, CITE-seq and Spatial.

scRNA-seq

We can obtain the SingleCellExperiment object with gene expression (RNA) data as follows:

(sce <- HCATonsilData(assayType = "RNA", cellType = "All"))
table(sce$assay)

This object consists of 377,988 profiled with scRNA-seq (3P) and 84,364 cells profiled with multiome, for a total of 462,352 cells (37,378 genes were quantified across all of these).

We can dowload a SingleCellExperiment object specific to each of the main subpopulations defined at level 1 as follows:

listCellTypes(assayType = "RNA")
(epithelial <- HCATonsilData(assayType = "RNA", cellType = "epithelial"))
data("colors_20230508")
scater::plotUMAP(epithelial, colour_by = "annotation_20230508") +
  ggplot2::scale_color_manual(values = colors_20230508$epithelial) +
  ggplot2::theme(legend.title = ggplot2::element_blank())

As we can see, we can use the same colors as in the publication using the colors_20230508 variable.

We can also download datasets from the preprint (discovery cohort, version 1.0) - we first load a list of matched annotations used throughout the course of the project, used internally by the main HCATonsilData() function:

data("annotations_dictionary")
annotations_dictionary[["dict_20220619_to_20230508"]]

# load also a predefined palette of colors, to match the ones used in the manuscript 
data("colors_20230508")

(epithelial_discovery <- HCATonsilData("RNA", "epithelial", version = "1.0"))
scater::plotUMAP(epithelial, colour_by = "annotation_20230508") +
  ggplot2::scale_color_manual(values = colors_20230508$epithelial) +
  ggplot2::theme(legend.title = ggplot2::element_blank())

Here's a brief explanation of all the variables in the colData slot of the SingleCellExperiment objects:

barcode: the cell barcode. Combination of GEM well and cellranger 10X barcode.
donor_id: the donor-specific identifier, which can be used to retrieve donor-level metadata form the table above.
gem_id: we gave a unique hashtag to each GEM well (10X Chip Channel) in our dataset. This allows to traceback all the metadata for a given cell.
library_name: each GEM well can give rise to multiple Illumina libraries. For example, one Multiome GEM well will give rise to 2 illumina libraries (ATAC and RNA).
assay: 3P (scRNA-seq) or multiome
sex
age
age_group: kid, young adult, old adult
hospital: hospital where the tonsils where obtained Hospital Clinic, CIMA, or Newcastle.
cohort_type: discovery or validation cohort
cause_for_tonsillectomy: condition that led to the tonsillectomy (e.g. tonsillitis, sleep apnea, etc.)
is_hashed: whether we used cell hashing or not.
preservation: whether we processed the sample fresh or frozen.
nCount_RNA, nFeature_RNA: number of counts and features (genes) detected per cell.
pct_mt, pct_ribosomal: percentage of counts that map to mitochondrial (^MT) or ribosomal (^RPS) genes.
pDNN_hashing, pDNN_scrublet, pDNN_union: proportion of doublet nearest neighbors (pDNN) using different doublet annotations.
S.Score, G2M.Score Phase, CC.Difference: outputs of the CellCycleScoring from Seurat.
scrublet_doublet_scores: doublet score obtained with Scrublet
scrublet_predicted_doublet: doublet annotation obtained with Scirpy (doublet or singlet)
doublet_score_scDblFinder: doublet score obtained with scDblFinder, which was run in the validation cohort following the most recent best practices for single-cell analysis.
annotation_level_1: annotation used at our level 1 clustering (9 main compartments).
annotation_level_1_probability: annotation confidence for the level 1 annotation (relevant for validation cohort, as it implied KNN annotation)
annotation_figure_1: annotation used in the figure 1 of our manuscript. This annotation consisted of grouping the final subtypes into main cell types that were distinguishable in the UMAP.
annotation_20220215, annotation_20220619, annotation_20230508: time-stamped annotation for cell types.
annotation_20230508_probability: annotation confidence for the final annotation (relevant for validation cohort, as it implied KNN annotation)
UMAP_1_level_1, UMAP_2_level_1: UMAP1 coordinates of the figure 1B of the article.
annotation_20220215: see above.
UMAP_1_20220215, UMAP_2_20220215: UMAP coordinates used in figures of the preprint for each cell type.

scATAC-seq and Multiome

Since there is not a popular Bioconductor package to analyze or store scATAC-seq data, we point users to the scATAC-seq and Multiome Seurat objects that we created using Signac [@stuart2021single].

Here are the instructions to download the scATAC-seq object (approximately ~9.3 Gb in size):

library("Seurat")
library("Signac")

download_dir = tempdir()

options(timeout = 10000000)
atac_url <- "https://zenodo.org/record/8373756/files/TonsilAtlasSeuratATAC.tar.gz"
download.file(
  url = atac_url, 
  destfile = file.path(download_dir, "TonsilAtlasSeuratATAC.tar.gz")
)
# Advice: check that the md5sum is the same as the one in Zenodo
untar(
  tarfile = file.path(download_dir, "TonsilAtlasSeuratATAC.tar.gz"), 
  exdir = download_dir
)
atac_seurat <- readRDS(
  file.path(download_dir, "scATAC-seq/20230911_tonsil_atlas_atac_seurat_obj.rds")
)
atac_seurat

The multiome object contains 68,749 cells that passed both RNA and ATAC QC filters. Here are the instructions to download the Multiome object (approximately ~5.7 Gb in size):

library("Seurat")
library("Signac")

download_dir = tempdir()

options(timeout = 10000000)
multiome_url <- "https://zenodo.org/record/8373756/files/TonsilAtlasSeuratMultiome.tar.gz"
download.file(
  url = multiome_url, 
  destfile = file.path(download_dir, "TonsilAtlasSeuratMultiome.tar.gz")
)
# Advice: check that the md5sum is the same as the one in Zenodo
untar(
  tarfile = file.path(download_dir, "TonsilAtlasSeuratMultiome.tar.gz"), 
  exdir = download_dir
)
multiome_seurat <- readRDS(
  file.path(download_dir, "/multiome/20230911_tonsil_atlas_multiome_seurat_obj.rds")
)
multiome_seurat

Note: we may version these two objects in Zenodo to polish the annotation. Therefore, we recommend users to check that this is the most updated version of the object in Zenodo.

To recall peaks or visualize chromatin tracks, it is essential to have access to the fragments file. We modified the original fragments files generated by cellranger-atac or cellranger-arc to include the gem_id as prefix. These files can be downloaded as follows (approximate size of ~26.0 Gb, feel free to grab a coffee in the meanwhile):

download_dir = tempdir()

fragments_url <- "https://zenodo.org/record/8373756/files/fragments_files.tar.gz"
download.file(
  url = fragments_url, 
  destfile = file.path(download_dir, "fragments_files.tar.gz")
)
untar(
  tarfile = file.path(download_dir, "fragments_files.tar.gz"),
  exdir = download_dir
)

Note: it is paramount to update the paths to the fragments files in each ChromatinAssay object using the UpdatePath function.

The chromatin accessibility side of these objects was processed using the general pipeline from Signac. Briefly:

We called peaks using MACS2 (as implemented in the CallPeaks function), generating pseudobulks for each of the main 9 compartments.
We quantified the accessibility for each peak and each cell using the FeatureMatrix function.
We normalized data using the term frequency inverse document frequency (TF-IDF) normalization, as implemented in the RunTFIDF function.
We reduced the dimensionality using latent semantic indexing (RunLSI function)

For Multiome, we additionally computed a weighted KNN graph with the FindMultiModalNeighbors function.

Many variables in the metadata slot are shared with scRNA-seq. Here, we explain the meaning of the variables that are specific to Multiome or scATAC-seq:

nCount_ATAC: total number of counts per cell
nFeature_ATAC: total number of peaks detected per cell
is_facs_sorted: whether that particular sample was FACS-sorted to eliminate dead cells
nucleosome_signal: as explained in Signac, Nucleosome banding pattern: The histogram of DNA fragment sizes (determined from the paired-end sequencing reads) should exhibit a strong nucleosome banding pattern corresponding to the length of DNA wrapped around a single nucleosome. We calculate this per single cell, and quantify the approximate ratio of mononucleosomal to nucleosome-free fragments
nucleosome_percentile: percentile rank of each ratio of mononucleosomal to nucleosome-free fragments per cell (see NucleosomeSignal function)
TSS.enrichment: as explained in Signac, The ENCODE project has defined an ATAC-seq targeting score based on the ratio of fragments centered at the TSS to fragments in TSS-flanking regions (see https://www.encodeproject.org/data-standards/terms/). Poor ATAC-seq experiments typically will have a low TSS enrichment score.
TSS.percentile
RNA.weight: the weight that RNA has in the weighted nearest neighbor graph per cell (multiome-specific)
ATAC.weight: the weight that RNA has in the weighted nearest neighbor graph per cell (multiome-specific)

CITE-seq

The CITE-seq object can be downloaded as follows (approximate size of ~0.4 Gb):

library("Seurat")

download_dir = tempdir()

options(timeout = 10000000)
cite_url <- "https://zenodo.org/record/8373756/files/TonsilAtlasSeuratCITE.tar.gz"
download.file(
  url = cite_url, 
  destfile = file.path(download_dir, "TonsilAtlasSeuratCITE.tar.gz")
)
# Advice: check that the md5sum is the same as the one in Zenodo
untar(
  tarfile = file.path(download_dir, "TonsilAtlasSeuratCITE.tar.gz"), 
  exdir = download_dir
)

cite_seurat <- readRDS(
  file.path(download_dir, "CITE-seq/20220215_tonsil_atlas_cite_seurat_obj.rds")
)
cite_seurat

Together with CITE-seq, we also applied scBCR-seq and scTCR-seq. This data was analyzed with scirpy, [@sturm2020scirpy] and we provide the main output of this analysis (for cytotoxic cells) as a dataframe, which can be imported as follows:

scirpy_df <- read.csv(
  file = file.path(download_dir, "CITE-seq/scirpy_tcr_output.tsv"),
  header = TRUE
)

head(scirpy_df)

Spatial transcriptomics

A r BiocStyle::Biocpkg("SpatialExperiment") of the spatial transcriptomics dataset may be retrieved via assayType="Spatial". The dataset contains 8 tissue slices with ~1,000-3,000 cells each that were profiled on two separated slides, as well as a low-resolution (H&E staining) image for each slice.

library("SpatialExperiment")
(spe <- HCATonsilData(assayType = "Spatial"))

To plot gene expression you can use the ggspavis package:

library("ggspavis")
sub <- spe[, spe$sample_id == "esvq52_nluss5"]
plt <- plotVisium(sub, fill="SELENOP") + 
  scale_fill_gradientn(colors=rev(hcl.colors(9, "Spectral")))
plt$layers[[2]]$aes_params$size <- 1.5
plt$layers[[2]]$aes_params$alpha <- 1
plt$layers[[2]]$aes_params$stroke <- NA
plt

Annotations of the Tonsil Data Atlas

To allow users to traceback the rationale behind each and every of our annotations, we provide a detailed glossary of 121 cell types and states and related functions to get the explanation, markers and references of every annotation. You can access the glossary as a dataframe as follows:

glossary_df <- TonsilData_glossary()
head(glossary_df)

To get the glossary for each cell type with nice printing formatting you can use the TonsilData_cellinfo() function:

TonsilData_cellinfo("Tfr")

Alternatively, you can get a static html with links to markers and articles with TonsilData_cellinfo_html

htmltools::includeHTML(
  TonsilData_cellinfo_html("Tfr", display_plot = TRUE)
)

Although we have put massive effort in annotating tonsillar cell types, cell type annotations are dynamic by nature. New literature or other interpretations of the data can challenge and refine our annotations. To accommodate this, we have developed the updateAnnotation function, which allows to periodically provide newer annotations as extra columns in the colData slot of the SingleCellExperiment objects. If you want to contribute in one of these versions of the upcoming annotations, please open an issue and describe your annotation.

Interoperability with other frameworks

While we provide data in the form of SingleCellExperiment objects, you may want to analyze your data using a different single-cell data container. In future releases, we will strive to make the SingleCellExperiment objects compatible with Seurat v5, which will come out after this release of BioConductor. Alternatively, you may want to obtain AnnData objects to analyze your data in scanpy ecosystem. You can convert and save the data as follows:

library("zellkonverter")
epithelial <- HCATonsilData(assayType = "RNA", cellType = "epithelial")

epi_anndatafile <- file.path(tempdir(), "epithelial.h5ad")

writeH5AD(sce = epithelial, file = epi_anndatafile)

The SingleCellExperiment objects obtained via HCATonsilData() can be explored in detail using e.g. additional Bioconductor packages, such as the iSEE package.

This can be as simple as executing this chunk:

if (require(iSEE)) {
  iSEE(epithelial)
}

Session information {-}

sessionInfo()

References

massonix/HCATonsilData documentation built on April 17, 2025, 3:10 a.m.

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

massonix/HCATonsilData
Provide programmatic access to the tonsil cell atlas datasets

In massonix/HCATonsilData: Provide programmatic access to the tonsil cell atlas datasets

Introduction

Installation

Overview of the dataset

Assay types

scRNA-seq

scATAC-seq and Multiome

CITE-seq

Spatial transcriptomics

Annotations of the Tonsil Data Atlas

Interoperability with other frameworks

Session information {-}

References

R Package Documentation

Browse R Packages

We want your feedback!

massonix/HCATonsilData Provide programmatic access to the tonsil cell atlas datasets

In massonix/HCATonsilData: Provide programmatic access to the tonsil cell atlas datasets

Introduction

Installation

Overview of the dataset

Assay types

scRNA-seq

scATAC-seq and Multiome

CITE-seq

Spatial transcriptomics

Annotations of the Tonsil Data Atlas

Interoperability with other frameworks

Session information {-}

References

R Package Documentation

Browse R Packages

We want your feedback!

massonix/HCATonsilData
Provide programmatic access to the tonsil cell atlas datasets