knitr::opts_chunk$set(echo = TRUE)

Overview

The HDCytoData package is an extensible resource containing a set of publicly available high-dimensional flow cytometry and mass cytometry (CyTOF) benchmark datasets, which have been formatted into SummarizedExperiment and flowSet Bioconductor object formats. The data objects are hosted on Bioconductor's ExperimentHub platform.

The objects each contain one or more tables of cell-level expression values, as well as all required metadata. Row metadata includes sample IDs, group IDs, patient IDs, reference cell population or cluster labels (where available), and labels identifying 'spiked in' cells (where available). Column metadata includes channel names, protein marker names, and protein marker classes (cell type, cell state, as well as non protein marker columns).

Note that raw expression values should be transformed prior to any downstream analyses (see below).

Currently, the package includes benchmark datasets used in our previous work to evaluate methods for clustering and differential analyses. The datasets are provided here in SummarizedExperiment and flowSet formats in order to make them easier to access and integrate into R/Bioconductor workflows.

For more details, see our paper describing the HDCytoData package:

Datasets

The package contains the following datasets, which can be grouped into datasets useful for benchmarking methods for (i) clustering, and (ii) differential analyses.

Extensive documentation is available in the help files for the objects. For each dataset, this includes a description of the dataset (e.g. biological context, number of samples and conditions, number of cells, number of reference cell populations, number and classes of protein markers, etc.), as well as an explanation of the object structures, details on accessor functions required to access the expression tables and metadata, and references to original data sources.

File sizes are listed in the help files for the datasets. The removeCache function from the ExperimentHub package can be used to clear the local download cache (see ExperimentHub documentation).

The help files can be accessed by the dataset names, e.g. ?Bodenmiller_BCR_XL or help(Bodenmiller_BCR_XL).

Programmatic access to list of datasets

An updated list of all available datasets can also be obtained programmatically using the ExperimentHub accessor functions, as follows. This retrieves a table of metadata from the ExperimentHub database, which includes information such as the ExperimentHub ID, title, and description for each dataset.

suppressPackageStartupMessages(library(ExperimentHub))

# Create ExperimentHub instance
ehub <- ExperimentHub()

# Find HDCytoData datasets
ehub <- query(ehub, "HDCytoData")
ehub

# Retrieve metadata table
md <- as.data.frame(mcols(ehub))

head(md, 2)

How to load data

This section shows how to load the datasets, using one of the datasets (Bodenmiller_BCR_XL) as an example.

The datasets can be loaded by either (i) referring to named functions for each dataset, or (ii) creating an ExperimentHub instance and referring to the dataset IDs. Both methods are demonstrated below.

See the help files (e.g. ?Bodenmiller_BCR_XL) for details about the structure of the SummarizedExperiment or flowSet objects.

Load the datasets using named functions:

suppressPackageStartupMessages(library(HDCytoData))

# Load 'SummarizedExperiment' object using named function
Bodenmiller_BCR_XL_SE()

# Load 'flowSet' object using named function
Bodenmiller_BCR_XL_flowSet()

Alternatively, load the datasets by creating an ExperimentHub instance:

# Create ExperimentHub instance
ehub <- ExperimentHub()

# Find HDCytoData datasets
query(ehub, "HDCytoData")

# Load 'SummarizedExperiment' object using dataset ID
ehub[["EH2254"]]

# Load 'flowSet' object using dataset ID
ehub[["EH2255"]]

Using the data

Once the datasets have been loaded from ExperimentHub, they can be used as normal within an R session. For example, using the SummarizedExperiment form of the dataset loaded above:

# Load dataset in 'SummarizedExperiment' format
d_SE <- Bodenmiller_BCR_XL_SE()

# Inspect object
d_SE
length(assays(d_SE))
assay(d_SE)[1:6, 1:6]
rowData(d_SE)
colData(d_SE)
metadata(d_SE)

Transformation of raw data

Note that flow and mass cytometry data should be transformed prior to performing any downstream analyses, such as clustering. Standard transformations include the asinh with cofactor parameter equal to 5 for mass cytometry (CyTOF) data, or 150 for flow cytometry data (see Bendall et al. 2011, Supplementary Figure S2).

Exploring the data

Interactive visualizations to explore the datasets can be generated from the SummarizedExperiment objects using the iSEE ("Interactive SummarizedExperiment Explorer") package, available from Bioconductor (Soneson, Lun, Marini, and Rue-Albrecht, 2018), which provides a Shiny-based graphical user interface to explore single-cell datasets stored in the SummarizedExperiment format. For more details, see the iSEE package vignettes.

Contribution guidelines

We welcome contributions or suggestions for new datasets to include in the HDCytoData package. Contribution guidelines are provided in the Contribution guidelines vignette, available from Bioconductor.

Citation

If the HDCytoData package is useful in your work, please cite the following paper:



lmweber/HDCytoData documentation built on March 19, 2024, 4:41 a.m.