In LTLA/scRNAseq: Collection of Public Single-Cell RNA-Seq Datasets

library(BiocStyle)
knitr::opts_chunk$set(error=FALSE, message=FALSE, warning=FALSE)

Downloading the count data

We obtain a single-cell RNA sequencing dataset of human pancreas from Muraro et al. (2016). A count matrix is provided from the Gene Expression Omnibus under the accession code GSE85241. We download it using r Biocpkg("BiocFileCache") to cache the results:

library(BiocFileCache)
bfc <- BiocFileCache("raw_data", ask=FALSE)
muraro.fname <- bfcrpath(bfc, paste0("ftp://ftp.ncbi.nlm.nih.gov/geo/series/",
    "GSE85nnn/GSE85241/suppl/",
    "GSE85241%5Fcellsystems%5Fdataset%5F4donors%5Fupdated%2Ecsv%2Egz"))

We first read the table into memory as a sparse matrix.

library(scuttle)
counts <- readSparseCounts(muraro.fname, quote="\"")
dim(counts)

Loading the column metadata

We extract the metadata from the column names.

donor.names <- sub("^(D[0-9]+).*", "\\1", colnames(counts))
plate.id <- sub("^D[0-9]+-([0-9]+)_.*", "\\1", colnames(counts))
table(donor.names, plate.id)

We also load in some cell annotations provided by the authors via Vladimir Wikislev and Martin Hemberg. Annoyingly, our original source of this file is no longer available, so we'll have to load the copy from ExperimentHub.

library(ExperimentHub)
ehub <- ExperimentHub()
old.coldata <- ehub[["EH2693"]]
stopifnot(identical(rownames(old.coldata), colnames(counts)))
table(old.coldata$label, useNA="always")

Now we assemble the new column data:

coldata <- DataFrame(label=old.coldata$label, donor=donor.names, plate=factor(plate.id))
coldata

Saving to file

We assemble everything into a SingleCellExperiment:

library(SingleCellExperiment)
sce <- SingleCellExperiment(list(counts=counts), colData=coldata)

# Splitting off the ERCCs.
status <- ifelse(grepl("^ERCC-[0-9]+", rownames(sce)), "ERCC", "endogenous")
sce <- splitAltExps(sce, status, ref="endogenous")

# Hacking out some gene details from the row names.
symbol <- sub("__.*", "", rownames(sce))
loc <- sub(".*__", "", rownames(sce))
rowData(sce) <- DataFrame(symbol=symbol, chr=loc)

We do some polishing to save disk space later.

library(scRNAseq)
sce <- polishDataset(sce)
sce

And we then save it all to disk.

meta <- list(
    title="A Single-Cell Transcriptome Atlas of the Human Pancreas",
    description="To understand organ function, it is important to have an inventory of its cell types and of their corresponding marker genes. This is a particularly challenging task for human tissues like the pancreas, because reliable markers are limited. Hence, transcriptome-wide studies are typically done on pooled islets of Langerhans, obscuring contributions from rare cell types and of potential subpopulations. To overcome this challenge, we developed an automated platform that uses FACS, robotics, and the CEL-Seq2 protocol to obtain the transcriptomes of thousands of single pancreatic cells from deceased organ donors, allowing in silico purification of all main pancreatic cell types. We identify cell type-specific transcription factors and a subpopulation of REG3A-positive acinar cells. We also show that CD24 and TM4SF4 expression can be used to sort live alpha and beta cells with high purity. This resource will be useful for developing a deeper understanding of pancreatic biology and pathophysiology of diabetes mellitus.",
    taxonomy_id="9606",
    genome="GRCh37",
    sources=list(
        list(provider="GEO", id="GSE85241"),
        list(provider="PubMed", id="27693023"),
        list(provider="ExperimentHub", id="EH2693")
    ),
    maintainer_name="Aaron Lun",
    maintainer_email="infinite.monkeys.with.keyboards@gmail.com"
)

saveDataset(sce, "2023-12-19_output", meta)