library(BiocStyle)
knitr::opts_chunk$set(error=FALSE, message=FALSE, warning=FALSE)

Overview

Here, we prepare a single-cell RNA sequencing timecourse dataset of mouse gastrulation and early organogenesis. Whole mouse embryos were harvested and dissociated at 6-hour timepoints between embryonic day (E) 6.5 and 8.5. Most timepoints contain several replicate samples. We will set up both the unfiltered count matrices from CellRanger (having removed swapped molecules with DropletUtils::swappedDrops) as well as a highly processed form of the data, as was used in the analyses of Pijuan-Sala et al. (2019).

Preparing the processed data

We obtain the processed count data through the r Biocpkg("BiocFileCache") framework. This caches the data locally upon the first download, avoiding the need to repeat the download on subsequent analyses.

library(BiocFileCache)
bfc <- BiocFileCache("raw_data", ask=FALSE)
count.path <- bfcrpath(bfc, file.path("https://content.cruk.cam.ac.uk/",
    "jmlab/atlas_data/raw_counts.mtx.gz"))

We load in the count data from the MatrixMarket format, using methods from the r CRANpkg("Matrix") package:

library(Matrix)
counts <- readMM(count.path)
dim(counts)

We download the cell- and gene-level metadata using r Biocpkg("BiocFileCache"), and read them into R. The UMAP coordinates are separated for later inclusion as a reducedDim item. The extensive metadata associated with specific analyses from the publication of this dataset have already been removed.

meta.path <- bfcrpath(bfc, file.path("https://content.cruk.cam.ac.uk/",
    "jmlab/atlas_data/meta.tab.gz"))
meta.tab <- read.delim(meta.path, stringsAsFactors=FALSE)
umap <- as.matrix(meta.tab[, c("umapX", "umapY")])
colnames(umap) <- c("x", "y")
meta.tab <- meta.tab[, -which(names(meta.tab) %in% c("umapX", "umapY"))]
head(meta.tab)

gene.path <- bfcrpath(bfc, file.path("https://content.cruk.cam.ac.uk/",
    "jmlab/atlas_data/genes.tsv.gz"))
gene.tab <- read.delim(gene.path, header=FALSE, stringsAsFactors=FALSE)
colnames(gene.tab) <- c("ENSEMBL", "SYMBOL")
head(gene.tab)

We store the count matrix in a SingleCellExperiment object with the metadata associated with each cell and gene.

library(SingleCellExperiment)
sce <- SingleCellExperiment(list(counts=counts), 
    colData=meta.tab, rowData=gene.tab)
sce
rm(counts)
gc()

We also obtain the size factors and store them in sce.

sf <- bfcrpath(bfc, file.path("https://content.cruk.cam.ac.uk/",
    "jmlab/atlas_data/sizefactors.tab.gz"))
sf <- read.delim(sf, header=FALSE, stringsAsFactors=FALSE)[,1]
sizeFactors(sce) <- sf
head(sf)

A 50-dimensional batch-corrected principal component representation of the data is also available. We store this in the SingleCellExperiment object. Doublets and stripped nuclei are excluded from these representations - they are represented as NAs in the reducedDim slot, so that the representation is the correct dimension to fit in the SingleCellExperiment object.

pc.path <- bfcrpath(bfc, file.path("https://content.cruk.cam.ac.uk/",
    "jmlab/atlas_data/corrected_pcas.rds"))
pc.list <- readRDS(pc.path)
#following match induces NA values deliberately
pc.all <- pc.list$all[match(colData(sce)$cell, rownames(pc.list$all)),]
rownames(pc.all) <- colData(sce)$cell
reducedDim(sce, "pca.corrected") <- pc.all
head(pc.all[,1:5])

The UMAP coordinates from the colData are also saved; similarly, this excludes doublets and stripped nuclei.

reducedDim(sce, "umap") <- umap

We now save the data, splitting the large SingleCellExperiment object into smaller, sample-wise objects. We then upload these smaller files to r Biocpkg("ExperimentHub"). Splitting up the data allows easier access of specific subsets of the data, and also allows use of the data on low-memory machines.

base <- file.path("MouseGastrulationData", "atlas", "1.0.0")
dir.create(base, recursive=TRUE, showWarnings=FALSE)
saveRDS(rowData(sce), file=paste0(base, "/rowdata.rds"))
for(samp in unique(sce$sample)){
    sub <- sce[, sce$sample == samp]
    saveRDS(counts(sub), 
        file=paste0(base, "/counts-processed-sample", samp, ".rds"))
    saveRDS(colData(sub), 
        file=paste0(base, "/coldata-sample", samp, ".rds"))
    saveRDS(sizeFactors(sub), 
        file=paste0(base, "/sizefac-sample", samp, ".rds"))
    saveRDS(reducedDims(sub), 
        file=paste0(base, "/reduced-dims-sample", samp, ".rds"))
}

Setting up the raw data

We obtain the raw count data through the r Biocpkg("BiocFileCache") framework. Each file contains the raw (unfiltered) count matrix from CellRanger for each sample, with swapped molecules removed via DropletUtils::swappedDrops.

sample.paths <- character(36)
sample.indices <- c(1:10, 12:37)
for (i in seq_along(sample.paths)) {
    fname <- sprintf("sample_%s_unswapped.mtx.gz", sample.indices[i])
    sample.paths[i] <- bfcrpath(bfc, 
        file.path("https://content.cruk.cam.ac.uk/",
       "jmlab/atlas_data/unfiltered", fname))
}
barcode.paths <- character(36)
for (i in seq_along(barcode.paths)) {
    fname <- sprintf("barcodes_%s_unswapped.tsv.gz", sample.indices[i])
    barcode.paths[i] <- bfcrpath(bfc, 
        file.path("https://content.cruk.cam.ac.uk/",
       "jmlab/atlas_data/unfiltered", fname))
}

We read in each sparse matrix and serialize it into a sample-specific RDS file.

collected <- vector("list", length(sample.paths))
for (i in seq_along(collected)) {
    curmat <- readMM(sample.paths[i])
    colnames(curmat) <- read.table(barcode.paths[i], stringsAsFactors=FALSE)[,1]
    saveRDS(curmat, file=file.path(base, sprintf("counts-raw-sample%i.rds", sample.indices[i])))
}

Note that the row-level metadata is the same in both the raw and processed data, and does not need to be re-acquired. Column names of the matrices are the 10x cell barcodes.

Session information

sessionInfo()


MarioniLab/MouseGastrulationData documentation built on Jan. 31, 2024, 11:01 a.m.