In LTLA/scRNAseq: Collection of Public Single-Cell RNA-Seq Datasets

library(BiocStyle)
knitr::opts_chunk$set(error=FALSE, message=FALSE, warning=FALSE)

Downloading the count data

We obtain a single-cell RNA sequencing dataset of the mouse retina from Macosko et al. (2015). Counts for endogenous genes and spike-in transcripts are available from the Gene Expression Omnibus using the accession number GSE63472. We download and cache them using the r Biocpkg("BiocFileCache") package.

library(BiocFileCache)
bfc <- BiocFileCache("raw_data", ask = FALSE)
base.url <- "ftp://ftp.ncbi.nlm.nih.gov/geo/series/GSE63nnn/GSE63472/suppl/"
count.file <- bfcrpath(bfc, paste0(base.url,
    "GSE63472_P14Retina_merged_digital_expression.txt.gz"))

We load them into memory.

library(scuttle)
counts <- readSparseCounts(count.file)
dim(counts)

Downloading the per-cell metadata

We also download a file containing the metadata for each cell.

meta.url <- "http://mccarrolllab.com/wp-content/uploads/2015/05/retina_clusteridentities.txt"
meta.file <- bfcrpath(bfc, meta.url)
coldata <- read.delim(meta.file, stringsAsFactors=FALSE, header=FALSE)
colnames(coldata) <- c("cell.id", "cluster")

library(S4Vectors)
coldata <- as(coldata, "DataFrame")
coldata

We match the metadata to the columns.

m <- match(colnames(counts), coldata$cell.id)
coldata <- coldata[m,]
rownames(coldata) <- colnames(counts)
coldata$cell.id <- NULL # redundant and unnecessary
summary(is.na(m))

Saving to file

We now assemble the components into a SingleCellExperiment, with some polishing to save disk space.

library(scRNAseq)
sce <- SingleCellExperiment(list(counts=counts), colData=coldata)
sce <- polishDataset(sce)
sce

This is saved in preparation for upload:

meta <- list(
    title="Highly Parallel Genome-wide Expression Profiling of Individual Cells Using Nanoliter Droplets",
    description="Cells, the basic units of biological structure and function, vary broadly in type and state. Single-cell genomics can characterize cell identity and function, but limitations of ease and scale have prevented its broad application. Here we describe Drop-seq, a strategy for quickly profiling thousands of individual cells by separating them into nanoliter-sized aqueous droplets, associating a different barcode with each cell's RNAs, and sequencing them all together. Drop-seq analyzes mRNA transcripts from thousands of individual cells simultaneously while remembering transcripts' cell of origin. We analyzed transcriptomes from 44,808 mouse retinal cells and identified 39 transcriptionally distinct cell populations, creating a molecular atlas of gene expression for known retinal cell classes and novel candidate cell subtypes. Drop-seq will accelerate biological discovery by enabling routine transcriptional profiling at single-cell resolution.",
    taxonomy_id="10090",
    genome="GRCm38",
    sources=list(
        list(provider="GEO", id="GSE63472"),
        list(provider="PubMed", id="26000488"),
        list(provider="URL", id=meta.url, version=as.character(Sys.Date()))
    ),
    maintainer_name="Aaron Lun",
    maintainer_email="infinite.monkeys.with.keyboards@gmail.com"
)

saveDataset(sce, "2023-12-19_output", meta)