library(BiocStyle)
knitr::opts_chunk$set(error=FALSE, message=FALSE, warning=FALSE)

Overview

Here, we prepare single-cell RNA sequencing datasets of chimeric embryos. These chimeras study the effect of Tal1 knock-out in early mouse embryogenesis.

This dataset contains four 10X Genomics samples, with two technical replicates for each of the wild-type (host) and Tal1 (injected) chimeric conditions. We will set up both the unfiltered count matrices from CellRanger (having removed swapped molecules with DropletUtils::swappedDrops) as well as a highly processed form of the data. For the latter, full details of the processing can be found at https://github.com/MarioniLab/EmbryoTimecourse2018.

Preparing the processed data

We obtain the processed count data through the r Biocpkg("BiocFileCache") framework. This caches the data locally upon the first download, avoiding the need to repeat the download on subsequent analyses.

library(BiocFileCache)
bfc <- BiocFileCache("raw_data", ask=FALSE)
count.path <- bfcrpath(bfc, file.path("https://content.cruk.cam.ac.uk/",
    "jmlab/chimera_tal1_data/raw_counts.mtx.gz"))

We load in the count data from the MatrixMarket format, using methods from the r CRANpkg("Matrix") package:

library(Matrix)
counts <- readMM(count.path)
dim(counts)

We download the cell- and gene-level metadata using r Biocpkg("BiocFileCache"), and read them into R. Several columns of the cell-level metadata contain information highly specific to analyses performed in Pijuan-Sala et al.; these are removed. We also explicitly indicate in the metadata that the samples derived from a single embryo pool.

meta.path <- bfcrpath(bfc, file.path("https://content.cruk.cam.ac.uk/",
    "jmlab/chimera_tal1_data/meta.tab.gz"))
meta.tab <- read.delim(meta.path, stringsAsFactors=FALSE)
meta.tab <- meta.tab[, !grepl("haem", colnames(meta.tab))]
meta.tab$pool <- 1
head(meta.tab)

gene.path <- bfcrpath(bfc, file.path("https://content.cruk.cam.ac.uk/",
    "jmlab/chimera_tal1_data/genes.tsv.gz"))
gene.tab <- read.delim(gene.path, header=FALSE, stringsAsFactors=FALSE)
colnames(gene.tab) <- c("ENSEMBL", "SYMBOL")
head(gene.tab)

We store the count matrix in a SingleCellExperiment object with the metadata associated with each cell and gene.

library(SingleCellExperiment)
sce <- SingleCellExperiment(list(counts=counts), 
    colData=meta.tab, rowData=gene.tab)
sce

We also obtain the size factors and store them in sce.

sf <- bfcrpath(bfc, file.path("https://content.cruk.cam.ac.uk/",
    "jmlab/chimera_tal1_data/sizefactors.tab.gz"))
sf <- read.delim(sf, header=FALSE, stringsAsFactors=FALSE)[,1]
sizeFactors(sce) <- sf
head(sf)

A 50-dimensional batch-corrected principal component representation of the data is also available. We store this in the SingleCellExperiment object. Doublets and stripped nuclei are excluded from these representations - they are represented as NAs in the reducedDim slot, so that the representation is the correct dimension to fit in the SingleCellExperiment object.

pc.path <- bfcrpath(bfc, file.path("https://content.cruk.cam.ac.uk/",
    "jmlab/chimera_tal1_data/corrected_pcas_nodoubstripped.rds"))
pc.list <- readRDS(pc.path)
#following match induces NA values deliberately
pc <- pc.list$all[match(colData(sce)$cell, rownames(pc.list$all)),]
rownames(pc) <- colData(sce)$cell
reducedDim(sce, "pca.corrected") <- pc
head(pc[,1:5])

We now save the data, splitting the large SingleCellExperiment object into smaller, sample-wise objects. We then upload these smaller files to r Biocpkg("ExperimentHub"). Splitting up the data allows easier access of specific subsets of the data, and also allows use of the data on low-memory machines.

base <- file.path("MouseGastrulationData", "tal1-chimera", "1.0.0")
dir.create(base, recursive=TRUE, showWarnings=FALSE)
saveRDS(rowData(sce), file=paste0(base, "/rowdata.rds"))
for(samp in unique(sce$sample)){
    sub <- sce[, sce$sample == samp]
    saveRDS(counts(sub), 
        file=paste0(base, "/counts-processed-sample", samp, ".rds"))
    saveRDS(colData(sub), 
        file=paste0(base, "/coldata-sample", samp, ".rds"))
    saveRDS(sizeFactors(sub), 
        file=paste0(base, "/sizefac-sample", samp, ".rds"))
    saveRDS(reducedDims(sub), 
        file=paste0(base, "/reduced-dims-sample", samp, ".rds"))
}

Setting up the raw data

We obtain the raw count data through the r Biocpkg("BiocFileCache") framework. Each file contains the raw (unfiltered) count matrix from CellRanger for each sample, with swapped molecules removed via DropletUtils::swappedDrops.

sample.paths <- character(4)
for (i in seq_along(sample.paths)) {
    fname <- sprintf("sample_%s_unswapped.mtx.gz", i)
    sample.paths[i] <- bfcrpath(bfc, 
        file.path("https://content.cruk.cam.ac.uk/",
       "jmlab/chimera_tal1_data/unfiltered", fname))
}
barcode.paths <- character(4)
for (i in seq_along(barcode.paths)) {
    fname <- sprintf("barcodes_%s_unswapped.tsv.gz", i)
    barcode.paths[i] <- bfcrpath(bfc, 
        file.path("https://content.cruk.cam.ac.uk/",
       "jmlab/chimera_tal1_data/unfiltered", fname))
}

We read in each sparse matrix and serialize it into a sample-specific RDS file.

collected <- vector("list", length(sample.paths))
for (i in seq_along(collected)) {
    curmat <- readMM(sample.paths[i])
    colnames(curmat) <- read.table(barcode.paths[i], stringsAsFactors=FALSE)[,1]
    saveRDS(curmat, file=file.path(base, sprintf("counts-raw-sample%i.rds", i)))
}

Note that the row-level metadata is the same in both the raw and processed data, and does not need to be re-acquired. Column names of the matrices are the 10x cell barcodes.

Session information

sessionInfo()


MarioniLab/MouseGastrulationData documentation built on Jan. 31, 2024, 11:01 a.m.