HDCytoData-package: Data package of high-dimensional cytometry datasets

HDCytoDataR Documentation

Data package of high-dimensional cytometry datasets

Description

Data package containing a collection of high-dimensional cytometry datasets saved in SummarizedExperiment and flowSet Bioconductor object formats, hosted on Bioconductor ExperimentHub.

Details

Overview

This package contains a set of publicly available high-dimensional flow cytometry and mass cytometry (CyTOF) datasets, which have been formatted into SummarizedExperiment and flowSet Bioconductor object formats.

The objects contain the cell-level expression values, as well as row and column metadata. The row metadata includes sample IDs, group IDs, and true cell population labels or cluster labels (where available). The column metadata includes channel names, protein marker names, and protein marker classes (cell type, cell state, as well as non protein marker columns).

These datasets have been used in our previous work and publications for benchmarking purposes, e.g. to benchmark clustering algorithms or methods for differential analysis. They are provided here in the SummarizedExperiment and flowSet formats to make them easier to access.

The package contains the following datasets, which can be grouped into datasets useful for benchmarking either (i) clustering algorithms or (ii) methods for differential analysis.

Clustering:

  • Levine_32dim

  • Levine_13dim

  • Samusik_01

  • Samusik_all

  • Nilsson_rare

  • Mosmann_rare

Differential analysis:

  • Krieg_Anti_PD_1

  • Bodenmiller_BCR_XL

Programmatic access to list of datasets

An updated list of all available datasets can also be obtained programmatically using the ExperimentHub accessor functions, as follows. This retrieves a table of metadata from the ExperimentHub database, which includes information such as the ExperimentHub ID, title, and description for each dataset.

ehub <- ExperimentHub() # create ExperimentHub instance
ehub <- query(ehub, "HDCytoData") # find HDCytoData datasets
md <- as.data.frame(mcols(ehub)) # retrieve metadata table

Additional details

For additional details on each dataset, including references and raw data sources, see the help files for each dataset.

For a short tutorial showing how to load the data objects, see the "HDCytoData package" vignette.

Note that flow and mass cytometry datasets should be transformed prior to performing any downstream analyses, such as clustering. Standard transforms include the asinh with cofactor parameter equal to 5 (for mass cytometry data) or 150 (for flow cytometry data).

The steps to prepare each data object from the raw data files are included in the make-data scripts in the directory inst/scripts.

File sizes are listed in the help files for the datasets. The removeCache function from the ExperimentHub package can be used to clear the local download cache.


lmweber/HDCytoData documentation built on March 19, 2024, 4:41 a.m.