knitr::opts_chunk$set( collapse = TRUE, comment = "#>", cache = TRUE, out.width = "100%" ) options(tibble.print_min = 5, tibble.print_max = 5)
Thank you for your interest!
The SingleCellMultiModal
package aims to provide single cell datasets
from several different technologies / modalities for benchmarking and analysis.
We currently provide from scNMT
, scM&T
, seqFISH
, CITEseq
, and other
technologies. Contributions are very much welcome.
For a full list of available datasets, see here: Google Drive Sheet
In order to contribute, we generally require data in Rda
or Rds
format
though we also support HDF5
and MTX
formats. Aside from the usual required
metadata.csv
documentation in the package, contributors are required to add a
name to the DataType
column in the metadata table that indicates the name of
the contributed dataset. To illustrate, here are some DataType
names already
in the package:
library(SingleCellMultiModal)
meta <- system.file("extdata", "metadata.csv", package = "SingleCellMultiModal", mustWork = TRUE) head(read.csv(meta))
We associate a version with all datasets. We start with version 1.0.0
using
semantic versioning and include data in a corresponding version folder
(v1.0.0
). Thus, the recommended folder structure is as follows:
~/data └ scmm/ └ mouse_gastrulation/ └ v1.0.0/ └ scnmt_acc_cgi.rda └ scnmt_met_genebody.rda └ scnmt_met_cgi.rda └ scnmt_rna.rda └ scnmt_colData.rda └ scnmt_sampleMap.rda
In the inst
section, we will discuss how to annotate these data products.
It is customary to include one Rda
/ Rds
file per assay or per assay and
region combination of interest (as above). We also highly recommend including
sampleMap
and colData
datasets for the MultiAssayExperiment
that will
be built on the fly. In this example, there are three modalities in the scNMT
dataset, rna
(transcriptome), acc
(chromatin accessibility), and met
(methylation).
Contributors are required to demonstrate user-level functionality via examples in a vignette for each contributed dataset.
Ideally, the interface for the contributed dataset should be similar to that
of scNMT
so that users have a sense of consistency in the usage of the
package. This means having one main function that returns a
MultiAssayExperiment
object and having options that show the user what
datasets are available for a particular technology. Contributors should use
roxygen2
for documenting datasets and using @inheritParams scNMT
tag
to avoid copying @param
documentation.
See the current example for implementation details:
scNMT( DataType = "mouse_gastrulation", mode = "*", version = "1.0.0", dry.run = TRUE )
Note. Contributors should ensure that the documentation is complete and the proper data sources have been attributed.
In the following section we will describe how to annotate and append to
the metadata.csv
file. First, we have to ensure that we are accounting for
all of the fields required by ExperimentHub
. They are listed here:
Note. DataType
is a field we've added to help distinguish multimodal
technologies and is required for SingleCellMultiModal
. Some of the
DataType
s already available are mouse_gastrulation
, mouse_visual_cortex
,
cord_blood
, peripheral_blood
, etc.
To make it easy for contributions, we've provided a mechanism for easy
documentation using a file from a data.frame
we call a doc_file
.
Interested contributors should create a doc_file
in inst/extdata/docuData
folder. Although we do not have a strict naming convention for the doc_file
,
we usually name the file singlecellmultimodalvX.csv
where X
is the nth
dataset added to the package.
Here is an example of the file from version v1.0.0
of the scNMT
dataset:
doc_file <- system.file("extdata", "docuData", "singlecellmultimodalv1.csv", package = "SingleCellMultiModal", mustWork = TRUE) read.csv(doc_file, header = TRUE)
Contributors will then use their doc_file
to append to the existing
metadata.csv
.
To create a doc_file
data.frame
with the file name
singlecellmultimodalvX.csv
, first we create a data.frame
object.
Each general annotation or row in this data.frame
will be applied to all
files uploaded to ExperimentHub
. We take advantage of the data.frame
function to repeat data and create a uniform data.frame
with equal values
across the columns.
scmeta <- data.frame( DataProvider = "Dept. of Bioinformatics, The Babraham Institute, United Kingdom", TaxonomyId = "10090", Species = "Mus musculus", SourceUrl = "https://cloudstor.aarnet.edu.au/plus/s/Xzf5vCgAEUVgbfQ", SourceType = "RDS", SourceVersion = "1.0.0", DataType = "mouse_gastrulation", Maintainer = "Ricard Argelaguet <ricard@ebi.ac.uk>", stringsAsFactors = FALSE ) scmeta
After creating the documentation data.frame
(doc_file
), the contributor can
save that dataset as a .csv
file using write.csv
.
write.csv( scmeta, file = "inst/extdata/docuData/singlecellmultimodal.csv", row.names = FALSE )
In the case that the contributed data is not uniform, meaning that there are
multiple file types from potentially different speciments, the data.frame
will have to account for all contributed data files.
For example, if the contributed data has a number of different source types,
the contributor is required to create a data.frame
with the number of rows
equal to the number of files to be uploaded.
In this example, we have two data files from different source types and formats:
data.frame( DataProvider = c("Institute of Population Genetics", "Mouse Science Center"), TaxonomyId = c("9606", "10090"), Species = c("Homo sapiens", "Mus musculus"), SourceUrl = c("https://human.science/org", "https://mouse.science/gov"), SourceType = c("RDS", "XML"), DataType = c("human_genetics", "mouse_genetics"), stringsAsFactors = FALSE )
The individual data products that will eventually come together into
a MultiAssayExperiment
can be uploaded as serialized RDA
/ RDS
files,
HDF5
, and even MTX
files. For examples on how to save data into
their respective file formats, see the make-data
folder.
Based on the folder structure described previously, the directory
argument in
make_metadata
will correspond to the ~/data/scmm
folder. The dataDir
folder will correspond to the DataType
/ technology subfolder (e.g.,
"mouse_gastrulation"). These will be used as inputs to the make_metadata
function.
Once the data is ready, the user can use the function in make-metadata.R
in the scripts
folder. A typical call to make_metadata
will either add to
the metadata or replace it entirely. The easiest for current contributors is to
append
rows to the metadata file.
make_metadata( directory = "~/data/scmm", dataDirs = "mouse_gastrulation", # also the name of the DataType ext_pattern = "\\.[Rr][Dd][Aa]$", doc_file = "inst/extdata/docuData/singlecellmultimodalv1.csv", pkg_name = "SingleCellMultiModal", append = TRUE, dry.run = TRUE )
Note that the extraction pattern (ext_pattern
) will allow contributors to
match a specific file extension in that folder and ignore any intermediate
files.
The contributor may also wish to run dry.run=TRUE
to see the output
data.frame
to be added to the metadata.csv
file.
Note. The make_metadata
function should be run from the base package
directory from a GitHub / git checkout (git clone ...
).
It is recommended to run the metadata validation function from
AnnotationHubData
:
AnnotationHubData::makeAnnotationHubMetadata("SingleCellMultiModal")
to ensure that some of the metadata fields are properly annotated.
Contributors should update the NEWS.md
file with a mention of the
function and data that are being provided. See the NEWS.md
for examples.
The contributor should then create a Pull Request on GitHub.
If you are interested in contributing, I can help you go over the contribution and submission. Please contact me either on the Bioc-community Slack (mramos148) or at marcel {dot} ramos [at] sph (dot) cuny (dot) edu. If you need to sign up to the community Slack channel, follow this link: https://bioc-community.herokuapp.com/
sessionInfo
sessionInfo()
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.