library(BiocStyle) knitr::opts_chunk$set(error=FALSE, message=FALSE, warning=FALSE)
Here, we prepare a single-cell RNA sequencing timecourse dataset of mouse gastrulation and early organogenesis.
Whole mouse embryos were harvested and dissociated at 6-hour timepoints between embryonic day (E) 6.5 and 8.5.
Most timepoints contain several replicate samples.
We will set up both the unfiltered count matrices from CellRanger (having removed swapped molecules with DropletUtils::swappedDrops
) as well as a highly processed form of the data, as was used in the analyses of Pijuan-Sala et al. (2019).
We obtain the processed count data through the r Biocpkg("BiocFileCache")
framework.
This caches the data locally upon the first download, avoiding the need to repeat the download on subsequent analyses.
library(BiocFileCache) bfc <- BiocFileCache("raw_data", ask=FALSE) count.path <- bfcrpath(bfc, file.path("https://content.cruk.cam.ac.uk/", "jmlab/atlas_data/raw_counts.mtx.gz"))
We load in the count data from the MatrixMarket format, using methods from the r CRANpkg("Matrix")
package:
library(Matrix) counts <- readMM(count.path) dim(counts)
We download the cell- and gene-level metadata using r Biocpkg("BiocFileCache")
, and read them into R.
The UMAP coordinates are separated for later inclusion as a reducedDim
item.
The extensive metadata associated with specific analyses from the publication of this dataset have already been removed.
meta.path <- bfcrpath(bfc, file.path("https://content.cruk.cam.ac.uk/", "jmlab/atlas_data/meta.tab.gz")) meta.tab <- read.delim(meta.path, stringsAsFactors=FALSE) umap <- as.matrix(meta.tab[, c("umapX", "umapY")]) colnames(umap) <- c("x", "y") meta.tab <- meta.tab[, -which(names(meta.tab) %in% c("umapX", "umapY"))] head(meta.tab) gene.path <- bfcrpath(bfc, file.path("https://content.cruk.cam.ac.uk/", "jmlab/atlas_data/genes.tsv.gz")) gene.tab <- read.delim(gene.path, header=FALSE, stringsAsFactors=FALSE) colnames(gene.tab) <- c("ENSEMBL", "SYMBOL") head(gene.tab)
We store the count matrix in a SingleCellExperiment
object with the metadata associated with each cell and gene.
library(SingleCellExperiment) sce <- SingleCellExperiment(list(counts=counts), colData=meta.tab, rowData=gene.tab) sce
rm(counts) gc()
We also obtain the size factors and store them in sce
.
sf <- bfcrpath(bfc, file.path("https://content.cruk.cam.ac.uk/", "jmlab/atlas_data/sizefactors.tab.gz")) sf <- read.delim(sf, header=FALSE, stringsAsFactors=FALSE)[,1] sizeFactors(sce) <- sf head(sf)
A 50-dimensional batch-corrected principal component representation of the data is also available.
We store this in the SingleCellExperiment
object.
Doublets and stripped nuclei are excluded from these representations - they are represented as NA
s in the reducedDim
slot, so that the representation is the correct dimension to fit in the SingleCellExperiment
object.
pc.path <- bfcrpath(bfc, file.path("https://content.cruk.cam.ac.uk/", "jmlab/atlas_data/corrected_pcas.rds")) pc.list <- readRDS(pc.path) #following match induces NA values deliberately pc.all <- pc.list$all[match(colData(sce)$cell, rownames(pc.list$all)),] rownames(pc.all) <- colData(sce)$cell reducedDim(sce, "pca.corrected") <- pc.all head(pc.all[,1:5])
The UMAP coordinates from the colData
are also saved; similarly, this excludes doublets and stripped nuclei.
reducedDim(sce, "umap") <- umap
We now save the data, splitting the large SingleCellExperiment
object into smaller, sample-wise objects.
We then upload these smaller files to r Biocpkg("ExperimentHub")
.
Splitting up the data allows easier access of specific subsets of the data, and also allows use of the data on low-memory machines.
base <- file.path("MouseGastrulationData", "atlas", "1.0.0") dir.create(base, recursive=TRUE, showWarnings=FALSE) saveRDS(rowData(sce), file=paste0(base, "/rowdata.rds")) for(samp in unique(sce$sample)){ sub <- sce[, sce$sample == samp] saveRDS(counts(sub), file=paste0(base, "/counts-processed-sample", samp, ".rds")) saveRDS(colData(sub), file=paste0(base, "/coldata-sample", samp, ".rds")) saveRDS(sizeFactors(sub), file=paste0(base, "/sizefac-sample", samp, ".rds")) saveRDS(reducedDims(sub), file=paste0(base, "/reduced-dims-sample", samp, ".rds")) }
We obtain the raw count data through the r Biocpkg("BiocFileCache")
framework.
Each file contains the raw (unfiltered) count matrix from CellRanger for each sample, with swapped molecules removed via DropletUtils::swappedDrops
.
sample.paths <- character(36) sample.indices <- c(1:10, 12:37) for (i in seq_along(sample.paths)) { fname <- sprintf("sample_%s_unswapped.mtx.gz", sample.indices[i]) sample.paths[i] <- bfcrpath(bfc, file.path("https://content.cruk.cam.ac.uk/", "jmlab/atlas_data/unfiltered", fname)) } barcode.paths <- character(36) for (i in seq_along(barcode.paths)) { fname <- sprintf("barcodes_%s_unswapped.tsv.gz", sample.indices[i]) barcode.paths[i] <- bfcrpath(bfc, file.path("https://content.cruk.cam.ac.uk/", "jmlab/atlas_data/unfiltered", fname)) }
We read in each sparse matrix and serialize it into a sample-specific RDS file.
collected <- vector("list", length(sample.paths)) for (i in seq_along(collected)) { curmat <- readMM(sample.paths[i]) colnames(curmat) <- read.table(barcode.paths[i], stringsAsFactors=FALSE)[,1] saveRDS(curmat, file=file.path(base, sprintf("counts-raw-sample%i.rds", sample.indices[i]))) }
Note that the row-level metadata is the same in both the raw and processed data, and does not need to be re-acquired. Column names of the matrices are the 10x cell barcodes.
sessionInfo()
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.