knitr::opts_chunk$set(error=FALSE, warning=FALSE, message=FALSE)
library(BiocStyle)

Introduction

This vignette describes the procedure to contribute new datasets to the imcdatasets package and contains guidelines for dataset formatting.

Contribution guidelines

Contributions or suggestions for new imaging mass cytometry (IMC) datasets to add to the imcdatasets package are always welcome. New datasets can be suggested by opening an issue at the imcdatasets GitHub page. The only requirements are that the new dataset (i) is publicly available and (ii) has been described in a published scientific article.

Details about creating Bioconductor's ExperimentHub packages are available here.

Create a data generation script

The first step is to create a new branch at the imcdatasets GitHub page.

Then, create an R markdown (.Rmd) script in .inst/scripts/ to generate the data objects:

The .Rmd script must be formatted in the same way as pre-existing scripts. Examples can be found here and here. Each step should be clearly and comprehensively documented.

For usability of the package and consistency across datasets, the data objects must be formatted as described in the Dataset format section below.

Update the documentation

Other files in the imcdatasets package should be updated to include the new dataset:

Open a pull request

After these steps have been completed, open a pull request at the imcdataset GitHub page.

The package maintainers will do the following:

Contributors will be recognized by having their names added to the DESCRIPTION file of the imcdatasets package.

Dataset format

The imcdatasets package is meant to provide quick and easy access to published and curated IMC datasets. Each dataset consists of three data objects that can be retrieved individually:

The three data objects can be mapped using unique image_name values contained in the metadata of each object.

For consistency across datasets, the guidelines below must be followed when creating a new dataset.

Single cell data

Single cell data should be formatted into a SingleCellExperiment object named sce that contains the following slots:

colData

The colData entry of the SingleCellExperiment object is a DataFrame that contains observation metadata; i.e., cells, slides, tissue, patients, .... It is recommended that all column names have a prefix that indicates the level of observation (e.g. cell_, slide_ , tissue_, patient_, tumor_).

The following columns are required:

In addition, colnames(sce) should be set as colData(sce)$cell_id.

rowData

The rowData entry of the SingleCellExperiment is a DataFrame that contains marker (protein, RNA, probe) information.

The following columns are required in the rowData entry:

For the full_name and short_name columns, the following guidelines apply:

marker_names <- data.frame(
    full_name = c(
        "Carbonic anhydrase IX",
        "CD3 epsilon",
        "CD8 alpha",
        "E-Cadherin",
        "cleaved-Caspase3 + cleaved-PARP",
        "Cytokeratin 5",
        "Forkhead box P3",
        "Glucose transporter 1",
        "Histone H3",
        "phospho-Histone H3 [S28]",
        "Ki-67",
        "Myeloperoxidase",
        "Programmed cell death protein 1",
        "Programmed death-ligand 1",
        "phospho-Rb [S807/S811]",
        "Smooth muscle actin",
        "Vimentin",
        "Iridium 191",
        "Iridium 193"
    ),
    short_name = c(
        "CA9",
        "CD3e",
        "CD8a",
        "CDH1",
        "cCASP3_cPARP",
        "KRT5",
        "FOXP3",
        "SLC2A1",
        "H3",
        "p_H3",
        "Ki67",
        "MPO",
        "PD_1",
        "PD_L1",
        "p_Rb",
        "SMA",
        "VIM",
        "DNA1",
        "DNA2"
    )
)
knitr::kable(
    marker_names,
    caption = "'full_name' and 'short_name' examples for some commonly 
        used markers"
)

In addition, rownames(sce) should be set as rowData(sce)$short_name.

assays

The assays slot of the SingleCellExperiment contains counts matrices representing marker expression levels per cell and channel.

It should at least contain a counts matrix with raw ion counts. The assays slot can also contain additional matrices with commonly used counts transformations, or counts transformations that were used in the publication that describes the dataset. All counts transformations must be documented in the .R function used to load the dataset. Common examples include:

colPairs

Neighborhood information, such as a list of cells that are localized next to each other, can be stored as a SelfHits object in the colPair slot of the SingleCellExperiment object.

Images and masks

Images

Multichannel images are stored in a CytoImageList object named images.

Channel names of the images object (channelNames(images)) must map to rownames(sce) (marker short names).

The metadata slot (mcols(images)) must contain an image_name column that maps to the image_name column of colData(sce), and to the image_name column of mcols(masks). This information is used by cytomapper to associate multichannel images, cell segmentation masks, and single cell data.

Masks

Cell segmentation masks are stored in a CytoImageList object named masks.

The values of the masks should be integers mapping to the cell_number column of colData(sce). This information is used by cytomapper to associate single cell data and cell segmentation masks.

The metadata slot (mcols(masks)) must contain an image_name column that maps to the image_name column of colData(sce), and to the image_name column of mcols(images). This information is used by cytomapper to associate multichannel images, cell segmentation masks, and single cell data.

Session info {.unnumbered}

sessionInfo()


BodenmillerGroup/imcdatasets documentation built on March 20, 2024, 9:24 a.m.