In LTLA/scRNAseq: Collection of Public Single-Cell RNA-Seq Datasets

library(BiocStyle)
knitr::opts_chunk$set(error=FALSE, message=FALSE, warning=FALSE)

Downloading the count data

We obtain a single-cell RNA sequencing dataset of human pancreas from Grun et al. (2016). A count matrix for endogenous genes and spike-ins is provided from the Gene Expression Omnibus using accession code GSE81076. We download it using r Biocpkg("BiocFileCache") to cache the results:

library(BiocFileCache)
bfc <- BiocFileCache("raw_data", ask=FALSE)
grun.fname <- bfcrpath(bfc, "ftp://ftp.ncbi.nlm.nih.gov/geo/series/GSE81nnn/GSE81076/suppl/GSE81076%5FD2%5F3%5F7%5F10%5F17%2Etxt%2Egz")

We read the table into memory as a sparse matrix.

library(scuttle)
counts <- readSparseCounts(grun.fname, quote="\"")
dim(counts)

Forming a `SingleCellExperiment`

Wrapping the counts into a SCE:

library(SingleCellExperiment)
sce <- SingleCellExperiment(list(counts=counts))

Fleshing out the column metadata based on the GEO sample annotation:

lib.names <- sub("_.*", "", colnames(sce))
sce$donor <- sub("(D10|D17|D2|D3|D7).*", "\\1", lib.names)
sce$sample <- c(
    D10631="CD63+ sorted cells",
    D101="live sorted cells, library 1",
    D102="live sorted cells, library 2",
    D1713="CD13+ sorted cells",
    D172444="CD24+ CD44+ live sorted cells",
    D17All1="live sorted cells, library 1",
    D17All2="live sorted cells, library 2",
    D17TGFB="TGFBR3+ sorted cells",
    D2ex="exocrine fraction, live sorted cells",
    D3en1="live sorted cells, library 1",
    D3en2="live sorted cells, library 2",
    D3en3="live sorted cells, library 3",
    D3en4="live sorted cells, library 4",
    D3ex="exocrine fraction, live sorted cells",
    D71="live sorted cells, library 1",
    D72="live sorted cells, library 2",
    D73="live sorted cells, library 3",
    D74="live sorted cells, library 4"
)[lib.names]

Splitting spike-ins into an alternative experiment:

status <- ifelse(grepl("^ERCC-[0-9]+$", rownames(sce)), "ERCC", "endogenous")
sce <- splitAltExps(sce, status, ref="endogenous")

library(scRNAseq)
spike.exp <- altExp(sce, "ERCC")
spikedata <- countErccMolecules(volume = 20, dilution = 50000)
rowData(spike.exp) <- spikedata[rownames(spike.exp), ]
altExp(sce, "ERCC") <- spike.exp

Splitting up the gene names into some useful information for downstream users:

rowData(sce)$symbol <- sub("__.*", "", rownames(sce))
rowData(sce)$chr <- sub(".*__", "", rownames(sce))

And adding some polish:

library(scRNAseq)
sce <- polishDataset(sce)
sce

Saving to file

We save all of the components to file in preparation for upload:

meta <- list(
title="De Novo Prediction of Stem Cell Identity using Single-Cell Transcriptome Data [pancreas only]",
description="Adult mitotic tissues like the intestine, skin, and blood undergo constant turnover throughout the life of an organism. Knowing the identity of the stem cell is crucial to understanding tissue homeostasis and its aberrations upon disease. Here we present a computational method for the derivation of a lineage tree from single-cell transcriptome data. By exploiting the tree topology and the transcriptome composition, we establish StemID, an algorithm for identifying stem cells among all detectable cell types within a population. We demonstrate that StemID recovers two known adult stem cell populations, Lgr5+ cells in the small intestine and hematopoietic stem cells in the bone marrow. We apply StemID to predict candidate multipotent cell populations in the human pancreas, a tissue with largely uncharacterized turnover dynamics. We hope that StemID will accelerate the search for novel stem cells by providing concrete markers for biological follow-up and validation.

Maintainer note: this dataset contains only the pancreas cells from the study. The sorting protocol and donor identity for each cell were obtained from the sample annotations in the GEO entry. ERCC spike-in molecule counts were computed with a volume of 20 nL per cell and a dilution of 1:50000.",
taxonomy_id="9606",
genome="GRCh37",
sources=list(
list(provider="GEO", id='GSE81076'),
list(provider="PubMed", id='27345837')
),
maintainer_name="Aaron Lun",
maintainer_email="infinite.monkeys.with.keyboards@gmail.com"
)

saveDataset(sce, "2023-12-14_output", meta)