knitr::opts_chunk$set(error=FALSE, warning=FALSE, message=FALSE)
The r Biocpkg("scRNAseq")
package provides convenient access to several publicly available single-cell datasets in the form of SingleCellExperiment
objects.
We do all of the necessary data munging for each dataset beforehand, so that users can obtain a SingleCellExperiment
for immediate use in further analyses.
To enable discovery, each dataset is decorated with metadata such as the study title/abstract, the species involved, the number of cells, etc.
Users can also contribute their own published datasets to enable re-use by the wider Bioconductor community.
The surveyDatasets()
function will show all available datasets along with their metadata.
This can be used to discover interesting datasets for further analysis.
library(scRNAseq) all.ds <- surveyDatasets() all.ds
Users can also search on the metadata text using the searchDatasets()
function.
This accepts both simple text queries as well as more complicated expressions involving boolean operations.
# Find all datasets involving pancreas. searchDatasets("pancreas")[,c("name", "title")] # Find all mm10 datasets involving pancreas or neurons. searchDatasets( defineTextQuery("GRCm38", field="genome") & (defineTextQuery("neuro%", partial=TRUE) | defineTextQuery("pancrea%", partial=TRUE)) )[,c("name", "title")]
Keep in mind that the search results are not guaranteed to be reproducible - more datasets may be added over time, and existing datasets may be updated with new versions. Once a dataset of interest is identified, users should explicitly list the name and version of the dataset in their scripts to ensure reproducibility.
The fetchDataset()
function will download a particular dataset, returning it as a SingleCellExperiment
:
sce <- fetchDataset("zeisel-brain-2015", "2023-12-14") sce
For studies that generate multiple datasets, the dataset of interest must be explicitly requested via the path=
argument:
sce <- fetchDataset("baron-pancreas-2016", "2023-12-14", path="human") sce
By default, array data is loaded as a file-backed DelayedArray
from the r Biocpkg("HDF5Array")
package.
Setting realize.assays=TRUE
and/or realize.reduced.dims=TRUE
will coerce these to more conventional in-memory representations like ordinary arrays or dgCMatrix
objects.
assay(sce) sce <- fetchDataset("baron-pancreas-2016", "2023-12-14", path="human", realize.assays=TRUE) class(assay(sce))
Users can also fetch the metadata associated with each dataset:
str(fetchMetadata("zeisel-brain-2015", "2023-12-14"))
Want to contribute your own dataset to this package? It's easy! Just follow these simple steps for instant fame and prestige.
Format your dataset as a SummarizedExperiment
or SingleCellExperiment
.
Let's just make up something here.
r
library(SingleCellExperiment)
sce <- SingleCellExperiment(list(counts=matrix(rpois(1000, lambda=1), 100, 10)))
rownames(sce) <- sprintf("GENE_%i", seq_len(nrow(sce)))
colnames(sce) <- head(LETTERS, 10)
Assemble the metadata for your dataset.
This should be a list structured as specified in the Bioconductor metadata schema
Check out some examples from fetchMetadata()
- note that the application.takane
property will be automatically added later, and so can be omitted from the list that you create.
r
meta <- list(
title="My dataset",
description="This is my dataset",
taxonomy_id="10090",
genome="GRCh38",
sources=list(
list(provider="GEO", id="GSE12345"),
list(provider="PubMed", id="1234567")
),
maintainer_name="Chihaya Kisaragi",
maintainer_email="kisaragi.chihaya@765pro.com"
)
Save your SummarizedExperiment
(or whatever object contains your dataset) to disk with saveDataset()
.
This saves the dataset into a "staging directory" using language-agnostic file formats - check out the alabaster framework for more details.
In more complex cases involving multiple datasets, users may save each dataset into a subdirectory of the staging directory.
```r
staging <- tempfile() saveDataset(sce, staging, meta) list.files(staging, recursive=TRUE)
staging <- tempfile() dir.create(staging) saveDataset(sce, file.path(staging, "foo"), meta) saveDataset(sce, file.path(staging, "bar"), meta) # etc. ```
You can check that everything was correctly saved by reloading the on-disk data into the R session for inspection:
```r alabaster.base::readObject(file.path(staging, "foo")) ```
{NAME}-{SYSTEM}-{YEAR}
, where NAME
is the last name of the first author of the study,
SYSTEM
is the biological system (e.g., tissue, cell types) being studied,
and YEAR
is the year of publication for the dataset.An Rmarkdown file containing the code used to assemble the dataset.
This should be added to the scripts/
directory of this package,
in order to provide some record of how the dataset was created.
Wait for us to grant temporary upload permissions to your GitHub account.
Upload your staging directory to gypsum backend with gypsum::uploadDirectory()
.
On the first call to this function, it will automatically prompt you to log into GitHub so that the backend can authenticate you.
If you are on a system without browser access (e.g., most computing clusters), a token can be manually supplied via gypsum::setAccessToken()
.
r
gypsum::uploadDirectory(staging, "scRNAseq", "my_dataset_name", "my_version")
You can check that everything was successfully uploaded by calling fetchDataset()
with the same name and version:
```r fetchDataset("my_dataset_name", "my_version") ```
If you realized you made a mistake, no worries. Use the following call to clear the erroneous dataset, and try again:
```r gypsum::rejectProbation("scRNAseq", "my_dataset_name", "my_version") ```
sessionInfo()
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.