library(BiocStyle) knitr::opts_chunk$set(error=FALSE, message=FALSE, warning=FALSE)
We obtain a single-cell RNA sequencing dataset of human pancreas from @segerstolpe2016single.
A count matrix is provided in the ArrayExpress entry for this project.
We download it using r Biocpkg("BiocFileCache")
to cache the results:
library(BiocFileCache) bfc <- BiocFileCache("raw_data", ask=FALSE) emat <- bfcrpath(bfc, file.path("https://www.ebi.ac.uk/arrayexpress", "experiments/E-MTAB-5061/files/E-MTAB-5061.processed.1.zip")) unzip(emat, list=TRUE)
The file itself is quite complex:
X
columns are the RPKMs, for X
cells.X
columns are the counts.This requires some additional work to extract the useful data. The first line contains the names of the cells, so we can use this to determine the number and indices of the columns with per-cell counts.
count.file <- "pancreas_refseq_rpkms_counts_3514sc.txt" col.names <- read.table(unz(emat, count.file), header=FALSE, sep="\t", stringsAsFactors=FALSE, comment.char="", nrows = 1)[,-1] ncells <- length(col.names) what <- vector("list", ncells*2 + 2) what[[1]] <- "character" what[[2]] <- "character" what[seq_len(ncells) + ncells + 2] <- "integer"
We then read in the gene symbols and the counts.
emtab.df <- read.table(unz(emat, count.file), header=FALSE, sep="\t", stringsAsFactors=FALSE, colClasses=what, skip=1) gene.info <- emtab.df[,1:2] emtab.df <- emtab.df[,-(1:2)] colnames(emtab.df) <- col.names dim(emtab.df)
Some coercion is performed to yield a count matrix and a row-level DataFrame
.
library(S4Vectors) counts <- as.matrix(emtab.df) rowdata <- DataFrame(symbol=gene.info[,1], refseq=gene.info[,2])
We retrieve the column metadata fields from ArrayExpress using the same accession number.
meta.fname <- bfcrpath(bfc, file.path("https://www.ebi.ac.uk/arrayexpress", "files/E-MTAB-5061/E-MTAB-5061.sdrf.txt")) emtab.sdrf <- read.delim(meta.fname, stringsAsFactors=FALSE, check.names=FALSE) colnames(emtab.sdrf)
We make sure that the sample IDs match with the column names of the count matrix.
stopifnot(identical(sort(emtab.sdrf[["Source Name"]]), sort(colnames(emtab.df)))) emtab.sdrf <- emtab.sdrf[match(colnames(counts), emtab.sdrf[["Source Name"]]),]
We only keep the Source Name
and Characteristics
fields.
The other fields describe relationships to other files/identifiers within ArrayExpress and are not of (primary) interest.
keep <- grep("Characteristics", colnames(emtab.sdrf)) emtab.sdrf <- emtab.sdrf[,c(1, keep)] colnames(emtab.sdrf) <- sub(".*\\[(.*)\\]", "\\1", colnames(emtab.sdrf))
We also remove fields that only have one level and thus are not really useful.
has.multi.levels <- vapply(emtab.sdrf, function(x) length(unique(x))>1L, TRUE) emtab.sdrf <- emtab.sdrf[,has.multi.levels]
The cell type
field has some "not applicable"
entries that should be more formally represented as NA
s.
lost <- emtab.sdrf$`cell type`=="not applicable" emtab.sdrf$`cell type`[lost] <- NA_character_
We coerce this into a column-level DataFrame
.
coldata <- as(emtab.sdrf, "DataFrame") coldata
We now save all of the components to file for upload to r Biocpkg("ExperimentHub")
.
These will be used to construct a SingleCellExperiment
on the client side when the dataset is requested.
path <- file.path("scRNAseq", "segerstolpe-pancreas", "2.0.0") dir.create(path, showWarnings=FALSE, recursive=TRUE) saveRDS(counts, file=file.path(path, "counts.rds")) saveRDS(rowdata, file=file.path(path, "rowdata.rds")) saveRDS(coldata, file=file.path(path, "coldata.rds"))
sessionInfo()
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.