library(BiocStyle) knitr::opts_chunk$set(error=FALSE, message=FALSE, warning=FALSE)
Here, we prepare single-cell RNA sequencing datasets of chimeric embryos. These chimeras study the effect of Tal1 knock-out in early mouse embryogenesis.
This dataset contains four 10X Genomics samples, with two technical replicates for each of the wild-type (host) and Tal1 (injected) chimeric conditions.
We will set up both the unfiltered count matrices from CellRanger (having removed swapped molecules with DropletUtils::swappedDrops
) as well as a highly processed form of the data.
For the latter, full details of the processing can be found at https://github.com/MarioniLab/EmbryoTimecourse2018.
We obtain the processed count data through the r Biocpkg("BiocFileCache")
framework.
This caches the data locally upon the first download, avoiding the need to repeat the download on subsequent analyses.
library(BiocFileCache) bfc <- BiocFileCache("raw_data", ask=FALSE) count.path <- bfcrpath(bfc, file.path("https://content.cruk.cam.ac.uk/", "jmlab/chimera_tal1_data/raw_counts.mtx.gz"))
We load in the count data from the MatrixMarket format, using methods from the r CRANpkg("Matrix")
package:
library(Matrix) counts <- readMM(count.path) dim(counts)
We download the cell- and gene-level metadata using r Biocpkg("BiocFileCache")
, and read them into R.
Several columns of the cell-level metadata contain information highly specific to analyses performed in Pijuan-Sala et al.; these are removed.
We also explicitly indicate in the metadata that the samples derived from a single embryo pool.
meta.path <- bfcrpath(bfc, file.path("https://content.cruk.cam.ac.uk/", "jmlab/chimera_tal1_data/meta.tab.gz")) meta.tab <- read.delim(meta.path, stringsAsFactors=FALSE) meta.tab <- meta.tab[, !grepl("haem", colnames(meta.tab))] meta.tab$pool <- 1 head(meta.tab) gene.path <- bfcrpath(bfc, file.path("https://content.cruk.cam.ac.uk/", "jmlab/chimera_tal1_data/genes.tsv.gz")) gene.tab <- read.delim(gene.path, header=FALSE, stringsAsFactors=FALSE) colnames(gene.tab) <- c("ENSEMBL", "SYMBOL") head(gene.tab)
We store the count matrix in a SingleCellExperiment
object with the metadata associated with each cell and gene.
library(SingleCellExperiment) sce <- SingleCellExperiment(list(counts=counts), colData=meta.tab, rowData=gene.tab) sce
We also obtain the size factors and store them in sce
.
sf <- bfcrpath(bfc, file.path("https://content.cruk.cam.ac.uk/", "jmlab/chimera_tal1_data/sizefactors.tab.gz")) sf <- read.delim(sf, header=FALSE, stringsAsFactors=FALSE)[,1] sizeFactors(sce) <- sf head(sf)
A 50-dimensional batch-corrected principal component representation of the data is also available.
We store this in the SingleCellExperiment
object.
Doublets and stripped nuclei are excluded from these representations - they are represented as NA
s in the reducedDim
slot, so that the representation is the correct dimension to fit in the SingleCellExperiment
object.
pc.path <- bfcrpath(bfc, file.path("https://content.cruk.cam.ac.uk/", "jmlab/chimera_tal1_data/corrected_pcas_nodoubstripped.rds")) pc.list <- readRDS(pc.path) #following match induces NA values deliberately pc <- pc.list$all[match(colData(sce)$cell, rownames(pc.list$all)),] rownames(pc) <- colData(sce)$cell reducedDim(sce, "pca.corrected") <- pc head(pc[,1:5])
We now save the data, splitting the large SingleCellExperiment
object into smaller, sample-wise objects.
We then upload these smaller files to r Biocpkg("ExperimentHub")
.
Splitting up the data allows easier access of specific subsets of the data, and also allows use of the data on low-memory machines.
base <- file.path("MouseGastrulationData", "tal1-chimera", "1.0.0") dir.create(base, recursive=TRUE, showWarnings=FALSE) saveRDS(rowData(sce), file=paste0(base, "/rowdata.rds")) for(samp in unique(sce$sample)){ sub <- sce[, sce$sample == samp] saveRDS(counts(sub), file=paste0(base, "/counts-processed-sample", samp, ".rds")) saveRDS(colData(sub), file=paste0(base, "/coldata-sample", samp, ".rds")) saveRDS(sizeFactors(sub), file=paste0(base, "/sizefac-sample", samp, ".rds")) saveRDS(reducedDims(sub), file=paste0(base, "/reduced-dims-sample", samp, ".rds")) }
We obtain the raw count data through the r Biocpkg("BiocFileCache")
framework.
Each file contains the raw (unfiltered) count matrix from CellRanger for each sample, with swapped molecules removed via DropletUtils::swappedDrops
.
sample.paths <- character(4) for (i in seq_along(sample.paths)) { fname <- sprintf("sample_%s_unswapped.mtx.gz", i) sample.paths[i] <- bfcrpath(bfc, file.path("https://content.cruk.cam.ac.uk/", "jmlab/chimera_tal1_data/unfiltered", fname)) } barcode.paths <- character(4) for (i in seq_along(barcode.paths)) { fname <- sprintf("barcodes_%s_unswapped.tsv.gz", i) barcode.paths[i] <- bfcrpath(bfc, file.path("https://content.cruk.cam.ac.uk/", "jmlab/chimera_tal1_data/unfiltered", fname)) }
We read in each sparse matrix and serialize it into a sample-specific RDS file.
collected <- vector("list", length(sample.paths)) for (i in seq_along(collected)) { curmat <- readMM(sample.paths[i]) colnames(curmat) <- read.table(barcode.paths[i], stringsAsFactors=FALSE)[,1] saveRDS(curmat, file=file.path(base, sprintf("counts-raw-sample%i.rds", i))) }
Note that the row-level metadata is the same in both the raw and processed data, and does not need to be re-acquired. Column names of the matrices are the 10x cell barcodes.
sessionInfo()
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.