knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>",
  cache = TRUE,
  out.width = "100%"
)
options(tibble.print_min = 5, tibble.print_max = 5)

Overview

Thank you for your interest!

The SingleCellMultiModal package aims to provide single cell datasets from several different technologies / modalities for benchmarking and analysis. We currently provide from scNMT, scM&T, seqFISH, CITEseq, and other technologies. Contributions are very much welcome.

List of Multi-modal Datasets

For a full list of available datasets, see here: Google Drive Sheet

Contributing

In order to contribute, we generally require data in Rda or Rds format though we also support HDF5 and MTX formats. Aside from the usual required metadata.csv documentation in the package, contributors are required to add a name to the DataType column in the metadata table that indicates the name of the contributed dataset. To illustrate, here are some DataType names already in the package:

library(SingleCellMultiModal)
meta <- system.file("extdata", "metadata.csv",
    package = "SingleCellMultiModal", mustWork = TRUE)
head(read.csv(meta))

Versioning and folder structure

We associate a version with all datasets. We start with version 1.0.0 using semantic versioning and include data in a corresponding version folder (v1.0.0). Thus, the recommended folder structure is as follows:

~/data
  └ scmm/
    └ mouse_gastrulation/
      └ v1.0.0/
        └ scnmt_acc_cgi.rda
        └ scnmt_met_genebody.rda
        └ scnmt_met_cgi.rda
        └ scnmt_rna.rda
        └ scnmt_colData.rda
        └ scnmt_sampleMap.rda

In the inst section, we will discuss how to annotate these data products.

Files

It is customary to include one Rda / Rds file per assay or per assay and region combination of interest (as above). We also highly recommend including sampleMap and colData datasets for the MultiAssayExperiment that will be built on the fly. In this example, there are three modalities in the scNMT dataset, rna (transcriptome), acc (chromatin accessibility), and met (methylation).

vignettes

Contributors are required to demonstrate user-level functionality via examples in a vignette for each contributed dataset.

R

Ideally, the interface for the contributed dataset should be similar to that of scNMT so that users have a sense of consistency in the usage of the package. This means having one main function that returns a MultiAssayExperiment object and having options that show the user what datasets are available for a particular technology. Contributors should use roxygen2 for documenting datasets and using @inheritParams scNMT tag to avoid copying @param documentation.

See the current example for implementation details:

scNMT(
    DataType = "mouse_gastrulation",
    mode = "*",
    version = "1.0.0",
    dry.run = TRUE
)

Note. Contributors should ensure that the documentation is complete and the proper data sources have been attributed.

inst/*

extdata/

In the following section we will describe how to annotate and append to the metadata.csv file. First, we have to ensure that we are accounting for all of the fields required by ExperimentHub. They are listed here:

Note. DataType is a field we've added to help distinguish multimodal technologies and is required for SingleCellMultiModal. Some of the DataTypes already available are mouse_gastrulation, mouse_visual_cortex, cord_blood, peripheral_blood, etc.

To make it easy for contributions, we've provided a mechanism for easy documentation using a file from a data.frame we call a doc_file.

Interested contributors should create a doc_file in inst/extdata/docuData folder. Although we do not have a strict naming convention for the doc_file, we usually name the file singlecellmultimodalvX.csv where X is the nth dataset added to the package.

Here is an example of the file from version v1.0.0 of the scNMT dataset:

doc_file <- system.file("extdata", "docuData", "singlecellmultimodalv1.csv",
    package = "SingleCellMultiModal", mustWork = TRUE)
read.csv(doc_file, header = TRUE)

Contributors will then use their doc_file to append to the existing metadata.csv.

To create a doc_file data.frame with the file name singlecellmultimodalvX.csv, first we create a data.frame object. Each general annotation or row in this data.frame will be applied to all files uploaded to ExperimentHub. We take advantage of the data.frame function to repeat data and create a uniform data.frame with equal values across the columns.

scmeta <- data.frame(
    DataProvider =
        "Dept. of Bioinformatics, The Babraham Institute, United Kingdom",
    TaxonomyId = "10090",
    Species = "Mus musculus",
    SourceUrl = "https://cloudstor.aarnet.edu.au/plus/s/Xzf5vCgAEUVgbfQ",
    SourceType = "RDS",
    SourceVersion = "1.0.0",
    DataType = "mouse_gastrulation",
    Maintainer = "Ricard Argelaguet <ricard@ebi.ac.uk>",
    stringsAsFactors = FALSE
)
scmeta

Saving the data

After creating the documentation data.frame (doc_file), the contributor can save that dataset as a .csv file using write.csv.

write.csv(
    scmeta,
    file = "inst/extdata/docuData/singlecellmultimodal.csv",
    row.names = FALSE
)

Documenting diverse data

In the case that the contributed data is not uniform, meaning that there are multiple file types from potentially different speciments, the data.frame will have to account for all contributed data files.

For example, if the contributed data has a number of different source types, the contributor is required to create a data.frame with the number of rows equal to the number of files to be uploaded.

In this example, we have two data files from different source types and formats:

data.frame(
    DataProvider =
        c("Institute of Population Genetics", "Mouse Science Center"),
    TaxonomyId = c("9606", "10090"),
    Species = c("Homo sapiens", "Mus musculus"),
    SourceUrl = c("https://human.science/org", "https://mouse.science/gov"),
    SourceType = c("RDS", "XML"),
    DataType = c("human_genetics", "mouse_genetics"),
    stringsAsFactors = FALSE
)

scripts/

make-data/

The individual data products that will eventually come together into a MultiAssayExperiment can be uploaded as serialized RDA / RDS files, HDF5, and even MTX files. For examples on how to save data into their respective file formats, see the make-data folder.

Generating the metadata.csv

make-metadata.R

Based on the folder structure described previously, the directory argument in make_metadata will correspond to the ~/data/scmm folder. The dataDir folder will correspond to the DataType / technology subfolder (e.g., "mouse_gastrulation"). These will be used as inputs to the make_metadata function.

Once the data is ready, the user can use the function in make-metadata.R in the scripts folder. A typical call to make_metadata will either add to the metadata or replace it entirely. The easiest for current contributors is to append rows to the metadata file.

make_metadata(
    directory = "~/data/scmm",
    dataDirs = "mouse_gastrulation", # also the name of the DataType
    ext_pattern = "\\.[Rr][Dd][Aa]$",
    doc_file = "inst/extdata/docuData/singlecellmultimodalv1.csv",
    pkg_name = "SingleCellMultiModal",
    append = TRUE,
    dry.run = TRUE
)

Note that the extraction pattern (ext_pattern) will allow contributors to match a specific file extension in that folder and ignore any intermediate files.

The contributor may also wish to run dry.run=TRUE to see the output data.frame to be added to the metadata.csv file.

Note. The make_metadata function should be run from the base package directory from a GitHub / git checkout (git clone ...).

Validation

It is recommended to run the metadata validation function from AnnotationHubData:

AnnotationHubData::makeAnnotationHubMetadata("SingleCellMultiModal")

to ensure that some of the metadata fields are properly annotated.

NEWS.md

Contributors should update the NEWS.md file with a mention of the function and data that are being provided. See the NEWS.md for examples.

Next steps

The contributor should then create a Pull Request on GitHub.

If you are interested in contributing, I can help you go over the contribution and submission. Please contact me either on the Bioc-community Slack (mramos148) or at marcel {dot} ramos [at] sph (dot) cuny (dot) edu. If you need to sign up to the community Slack channel, follow this link: https://bioc-community.herokuapp.com/

sessionInfo

sessionInfo

sessionInfo()



waldronlab/SingleCellMultiModal documentation built on May 1, 2024, 5:29 a.m.