In drisso/scRNAseq: Collection of Public Single-Cell RNA-Seq Datasets

library(BiocStyle)
knitr::opts_chunk$set(error=FALSE, message=FALSE, warning=FALSE)

Downloading the count data

We obtain a single-cell RNA sequencing dataset of human pancreas from @segerstolpe2016single. A count matrix is provided in the ArrayExpress entry for this project. We download it using r Biocpkg("BiocFileCache") to cache the results:

library(BiocFileCache)
bfc <- BiocFileCache("raw_data", ask=FALSE)    
emat <- bfcrpath(bfc, file.path("https://www.ebi.ac.uk/arrayexpress",
    "experiments/E-MTAB-5061/files/E-MTAB-5061.processed.1.zip"))
unzip(emat, list=TRUE)

The file itself is quite complex:

The first 2 columns contain the gene symbol and NCBI GenBank transcript identifiers for each row.
The next X columns are the RPKMs, for X cells.
The remaining X columns are the counts.

This requires some additional work to extract the useful data. The first line contains the names of the cells, so we can use this to determine the number and indices of the columns with per-cell counts.

count.file <- "pancreas_refseq_rpkms_counts_3514sc.txt"
col.names <- read.table(unz(emat, count.file), header=FALSE, sep="\t", 
    stringsAsFactors=FALSE, comment.char="", nrows = 1)[,-1]
ncells <- length(col.names)

what <- vector("list", ncells*2 + 2)
what[[1]] <- "character"
what[[2]] <- "character"
what[seq_len(ncells) + ncells + 2] <- "integer"

We then read in the gene symbols and the counts.

emtab.df <- read.table(unz(emat, count.file), header=FALSE, sep="\t", 
    stringsAsFactors=FALSE, colClasses=what, skip=1)
gene.info <- emtab.df[,1:2]
emtab.df <- emtab.df[,-(1:2)]
colnames(emtab.df) <- col.names
dim(emtab.df)

Some coercion is performed to yield a count matrix and a row-level DataFrame.

library(S4Vectors)
counts <- as.matrix(emtab.df)
rowdata <- DataFrame(symbol=gene.info[,1], refseq=gene.info[,2])

Preparing the column metadata

We retrieve the column metadata fields from ArrayExpress using the same accession number.

meta.fname <- bfcrpath(bfc, file.path("https://www.ebi.ac.uk/arrayexpress",
    "files/E-MTAB-5061/E-MTAB-5061.sdrf.txt"))
emtab.sdrf <- read.delim(meta.fname, stringsAsFactors=FALSE, check.names=FALSE)
colnames(emtab.sdrf)

We make sure that the sample IDs match with the column names of the count matrix.

stopifnot(identical(sort(emtab.sdrf[["Source Name"]]), sort(colnames(emtab.df))))
emtab.sdrf <- emtab.sdrf[match(colnames(counts), emtab.sdrf[["Source Name"]]),]

We only keep the Source Name and Characteristics fields. The other fields describe relationships to other files/identifiers within ArrayExpress and are not of (primary) interest.

keep <- grep("Characteristics", colnames(emtab.sdrf))
emtab.sdrf <- emtab.sdrf[,c(1, keep)]
colnames(emtab.sdrf) <- sub(".*\\[(.*)\\]", "\\1", colnames(emtab.sdrf))

We also remove fields that only have one level and thus are not really useful.

has.multi.levels <- vapply(emtab.sdrf, function(x) length(unique(x))>1L, TRUE)
emtab.sdrf <- emtab.sdrf[,has.multi.levels]

The cell type field has some "not applicable" entries that should be more formally represented as NAs.

lost <- emtab.sdrf$`cell type`=="not applicable"
emtab.sdrf$`cell type`[lost] <- NA_character_

We coerce this into a column-level DataFrame.

coldata <- as(emtab.sdrf, "DataFrame")
coldata

Saving to file

We now save all of the components to file for upload to r Biocpkg("ExperimentHub"). These will be used to construct a SingleCellExperiment on the client side when the dataset is requested.

path <- file.path("scRNAseq", "segerstolpe-pancreas", "2.0.0")
dir.create(path, showWarnings=FALSE, recursive=TRUE)
saveRDS(counts, file=file.path(path, "counts.rds"))
saveRDS(rowdata, file=file.path(path, "rowdata.rds"))
saveRDS(coldata, file=file.path(path, "coldata.rds"))