library(BiocStyle) knitr::opts_chunk$set(error=FALSE, message=FALSE, warning=FALSE)
We obtain a single-cell RNA sequencing dataset of human pancreas from Segerstolpe et al. (2016).
A count matrix is provided in the ArrayExpress entry for this project.
We download it using r Biocpkg("BiocFileCache")
to cache the results:
library(BiocFileCache) bfc <- BiocFileCache("raw_data", ask=FALSE) base.url <- "ftp://ftp.ebi.ac.uk/biostudies/fire/E-MTAB-/061/E-MTAB-5061/Files/" count.file <- bfcrpath(bfc, paste0(base.url, "pancreas_refseq_rpkms_counts_3514sc.txt"))
The file itself is quite complex:
X
columns are the RPKMs, for X
cells.X
columns are the counts.This requires some additional work to extract the useful data. The first line contains the names of the cells, so we can use this to determine the number and indices of the columns with per-cell counts.
col.names <- read.table(count.file, header=FALSE, sep="\t", stringsAsFactors=FALSE, comment.char="", nrows = 1)[,-1] ncells <- length(col.names) what <- vector("list", ncells*2 + 2) what[[1]] <- "character" what[[2]] <- "character" what[seq_len(ncells) + ncells + 2] <- "integer"
We then read in the gene symbols and the counts.
emtab.df <- read.table(count.file, header=FALSE, sep="\t", stringsAsFactors=FALSE, colClasses=what, skip=1) gene.info <- emtab.df[,1:2] emtab.df <- emtab.df[,-(1:2)] colnames(emtab.df) <- col.names dim(emtab.df)
Some coercion is performed to yield a count matrix and a row-level DataFrame
.
library(S4Vectors) counts <- as.matrix(emtab.df) rowdata <- DataFrame(symbol=gene.info[,1], refseq=gene.info[,2])
We retrieve the column metadata fields from ArrayExpress using the same accession number.
meta.fname <- bfcrpath(bfc, paste0(base.url, "E-MTAB-5061.sdrf.txt")) emtab.sdrf <- read.delim(meta.fname, stringsAsFactors=FALSE, check.names=FALSE, row.names=1) colnames(emtab.sdrf)
We make sure that the sample IDs match with the column names of the count matrix.
stopifnot(identical(sort(rownames(emtab.sdrf)), sort(colnames(counts)))) emtab.sdrf <- emtab.sdrf[match(colnames(counts), rownames(emtab.sdrf)),]
We only keep the Characteristics
fields.
The other fields describe relationships to other files/identifiers within ArrayExpress and are not of (primary) interest.
keep <- grep("Characteristics", colnames(emtab.sdrf)) emtab.sdrf <- emtab.sdrf[,keep] colnames(emtab.sdrf) <- sub(".*\\[(.*)\\]", "\\1", colnames(emtab.sdrf))
We also remove fields that only have one level and thus are not really useful.
has.multi.levels <- vapply(emtab.sdrf, function(x) length(unique(x))>1L, TRUE) emtab.sdrf <- emtab.sdrf[,has.multi.levels]
The cell type
field has some "not applicable"
entries that should be more formally represented as NA
s.
lost <- emtab.sdrf$`inferred cell type`=="not applicable" emtab.sdrf$`inferred cell type`[lost] <- NA_character_
We coerce this into a column-level DataFrame
.
coldata <- as(emtab.sdrf, "DataFrame") coldata
Slapping everything together into a SingleCellExperiment
:
library(SingleCellExperiment) sce <- SingleCellExperiment(list(counts=counts), colData=coldata, rowData=rowdata)
We split off the spike-ins.
status <- ifelse(grepl("^ERCC-[0-9]+", rowData(sce)$refseq), "ERCC", "endogenous") sce <- splitAltExps(sce, status, ref="endogenous") spike.exp <- altExp(sce, "ERCC") rownames(spike.exp) <- rowData(spike.exp)$refseq rowData(spike.exp) <- rowData(spike.exp)[,0] # not really useful for spike-ins. # This is wrong for one donor - donor "H1" has 100ul rather than 25ul - # but there's not much that we can do about that in the rowData. library(scRNAseq) spikedata <- countErccMolecules(volume = 25, dilution = 40000) spikedata <- spikedata[rownames(spike.exp), ] rowData(spike.exp) <- cbind(rowData(spike.exp), spikedata) altExp(sce, "ERCC") <- spike.exp
We use symbols as the row names, because the RefSeq identifiers are a mess, what with all the +
symbols everywhere.
rownames(sce) <- rowData(sce)$symbol rowData(sce)$symbol <- NULL
Finally we do some polishing to optimize for space.
library(scRNAseq) sce <- polishDataset(sce) sce
And saving it:
meta <- list( title="Single-Cell Transcriptome Profiling of Human Pancreatic Islets in Health and Type 2 Diabetes", description="Hormone-secreting cells within pancreatic islets of Langerhans play important roles in metabolic homeostasis and disease. However, their transcriptional characterization is still incomplete. Here, we sequenced the transcriptomes of thousands of human islet cells from healthy and type 2 diabetic donors. We could define specific genetic programs for each individual endocrine and exocrine cell type, even for rare δ, γ, ε, and stellate cells, and revealed subpopulations of α, β, and acinar cells. Intriguingly, δ cells expressed several important receptors, indicating an unrecognized importance of these cells in integrating paracrine and systemic metabolic signals. Genes previously associated with obesity or diabetes were found to correlate with BMI. Finally, comparing healthy and T2D transcriptomes in a cell-type resolved manner uncovered candidates for future functional studies. Altogether, our analyses demonstrate the utility of the generated single-cell gene expression resource. Maintainer note: ERCC molecule counts were computed based on the addition of 25 uL of a 1:40000 dilution of ERCC mix 1.", taxonomy_id="9606", genome="GRCh37", sources=list( list(provider="ArrayExpress", id="E-MTAB-5061"), list(provider="PubMed", id="27667667") ), maintainer_name="Aaron Lun", maintainer_email="infinite.monkeys.with.keyboards@gmail.com" ) saveDataset(sce, "2023-12-19_output", meta)
sessionInfo()
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.