library(BiocStyle)
knitr::opts_chunk$set(error=FALSE, message=FALSE, warning=FALSE)

Downloading the count data

We obtain a single-cell RNA sequencing dataset of the mouse hypothalamus from Romanov et al. (2017). Counts for endogenous genes are available from the Gene Expression Omnibus using the accession number GSE74672. We download and cache it using the r Biocpkg("BiocFileCache") package.

library(BiocFileCache)
bfc <- BiocFileCache("raw_data", ask = FALSE)
base.url <- "ftp://ftp.ncbi.nlm.nih.gov/geo/series/GSE74nnn/GSE74672/suppl/"
count.file <- bfcrpath(bfc, paste0(base.url, "GSE74672_expressed_mols_with_classes.xlsx.gz"))

We load them into memory.

library(R.utils)
tmp.file <- tempfile(fileext=".xlsx")
gunzip(count.file, destname=tmp.file, remove=FALSE)

# WARNING: this requires >20 GB of memory!
library(readxl)
all.data <- read_xlsx(tmp.file, col_names=FALSE)
dim(all.data)

Extracting the metadata

The first 9 rows correspond to metadata and need to be extracted.

coldata <- all.data[1:9,]
colfields <- coldata[,1,drop=TRUE]
coldata <- coldata[,-1]

coldata <- t(coldata)
coldata <- data.frame(coldata, stringsAsFactors=FALSE)
rownames(coldata) <- NULL

library(S4Vectors)
coldata <- DataFrame(coldata)
colnames(coldata) <- colfields

A great deal of sanitization is required to obtain the desired types for each column:

for (i in c("level2 cluster number (neurons only)", "total molecules")) {
    coldata[,i] <- as.integer(coldata[,i])
}
for (i in c("age (days postnatal)", "cell diameter")) {
    coldata[,i] <- as.numeric(coldata[,i])
}

We fix the sex so that it's more informative than +/-1.

is.sex <- colnames(coldata) == "sex (female=1,male=-1)"
sex.status <- coldata[,is.sex]
replacement.sex <- rep('unknown', nrow(coldata))
replacement.sex[sex.status == "1"] <- "female"
replacement.sex[sex.status == "-1"] <- "male"
coldata[,is.sex] <- replacement.sex
colnames(coldata)[is.sex] <- "sex"
table(sex.status, useNA="always")

The stress indicator could also be improved to a logical vector:

is.stress <- colnames(coldata) == "acute stress (true=1)"
coldata[,is.stress] <- coldata[,is.stress] == "1"
colnames(coldata)[is.stress] <- "acute stress"

This gives us something a bit cleaner:

coldata

Extracting the counts

The next three rows are empty, so we just skip them.

raw.counts <- all.data[-(1:12),]

We convert the remaining counts into a matrix.

counts <- as.matrix(raw.counts[,-1])
storage.mode(counts) <- "double"
rownames(counts) <- raw.counts[,1,drop=TRUE]
dim(counts)

We also rename the column names:

stopifnot(!anyDuplicated(coldata$cellID))
colnames(counts) <- coldata$cellID
rownames(coldata) <- coldata$cellID
coldata$cellID <- NULL

Saving to file

Putting everything together into a SingleCellExperiment:

library(SingleCellExperiment)
sce <- SingleCellExperiment(list(counts=counts), colData=coldata)

We split off the ERCCs into an alternative experiment:

status <- ifelse(grepl("^ERCC-[0-9]+", rownames(sce)), "ERCC", "endogenous")
sce <- splitAltExps(sce, status, ref="endogenous")

We apply some polish to optimize disk space:

library(scRNAseq)
sce <- polishDataset(sce)
sce

We now save the object to disk in preparation for upload.

meta <- list(
    title="Molecular interrogation of hypothalamic organization reveals distinct dopamine neuronal subtypes",
    description="The hypothalamus contains the highest diversity of neurons in the brain. Many of these neurons can co-release neurotransmitters and neuropeptides in a use-dependent manner. Investigators have hitherto relied on candidate protein-based tools to correlate behavioral, endocrine and gender traits with hypothalamic neuron identity. Here we map neuronal identities in the hypothalamus by single-cell RNA sequencing. We distinguished 62 neuronal subtypes producing glutamatergic, dopaminergic or GABAergic markers for synaptic neurotransmission and harboring the ability to engage in task-dependent neurotransmitter switching. We identified dopamine neurons that uniquely coexpress the Onecut3 and Nmur2 genes, and placed these in the periventricular nucleus with many synaptic afferents arising from neuromedin S+ neurons of the suprachiasmatic nucleus. These neuroendocrine dopamine cells may contribute to the dopaminergic inhibition of prolactin secretion diurnally, as their neuromedin S+ inputs originate from neurons expressing Per2 and Per3 and their tyrosine hydroxylase phosphorylation is regulated in a circadian fashion. Overall, our catalog of neuronal subclasses provides new understanding of hypothalamic organization and function.",
    taxonomy_id="10090",
    genome="GRCm38",
    sources=list(
        list(provider="GEO", id="GSE74672"),
        list(provider="PubMed", id="27991900")
    ),
    maintainer_name="Aaron Lun",
    maintainer_email="infinite.monkeys.with.keyboards@gmail.com"
)

saveDataset(sce, "2023-12-19_output", meta)

Session information {-}

sessionInfo()


LTLA/scRNAseq documentation built on June 28, 2024, 7:31 p.m.