library(BiocStyle) knitr::opts_chunk$set(error=FALSE, message=FALSE, warning=FALSE)
Here, we prepare single-cell RNA sequencing datasets of chimeric embryos. Wild-type cells (with for the addition of a fluorescent construct) were injected into wild-type embryos, to identify baseline effects from the generation of chimeras.
This dataset contains ten 10X Genomics samples, with embryo pools split between the E7.5 and E8.5 timepoints.
Two pools are present for E7.5, and three for E8.5.
We will set up both the unfiltered count matrices from CellRanger (having removed swapped molecules with DropletUtils::swappedDrops
) as well as a highly processed form of the data.
We obtain the processed count data through the r Biocpkg("BiocFileCache")
framework.
This caches the data locally upon the first download, avoiding the need to repeat the download on subsequent analyses.
library(BiocFileCache) bfc <- BiocFileCache("raw_data", ask=FALSE) count.path <- bfcrpath(bfc, file.path("https://content.cruk.cam.ac.uk/", "jmlab/chimera_wt_data/raw_counts.mtx.gz"))
We load in the count data from the MatrixMarket format, using methods from the r CRANpkg("Matrix")
package:
library(Matrix) counts <- readMM(count.path) dim(counts)
We download the cell- and gene-level metadata using r Biocpkg("BiocFileCache")
, and read them into R.
meta.path <- bfcrpath(bfc, file.path("https://content.cruk.cam.ac.uk/", "jmlab/chimera_wt_data/meta.tab.gz")) meta.tab <- read.delim(meta.path, stringsAsFactors=FALSE) head(meta.tab) gene.path <- bfcrpath(bfc, file.path("https://content.cruk.cam.ac.uk/", "jmlab/chimera_wt_data/genes.tsv.gz")) gene.tab <- read.delim(gene.path, header=FALSE, stringsAsFactors=FALSE) colnames(gene.tab) <- c("ENSEMBL", "SYMBOL") head(gene.tab)
We store the count matrix in a SingleCellExperiment
object with the metadata associated with each cell and gene.
library(SingleCellExperiment) sce <- SingleCellExperiment(list(counts=counts), colData=meta.tab, rowData=gene.tab) sce
We also obtain the size factors and store them in sce
.
sf <- bfcrpath(bfc, file.path("https://content.cruk.cam.ac.uk/", "jmlab/chimera_wt_data/sizefactors.tab.gz")) sf <- read.delim(sf, header=FALSE, stringsAsFactors=FALSE)[,1] sizeFactors(sce) <- sf head(sf)
A 50-dimensional batch-corrected principal component representation of the data for each timepoint is also available.
We store this in the SingleCellExperiment
object.
Doublets and stripped nuclei are excluded from these representations - they are represented as NA
s in the reducedDim
slot, so that the representation is the correct dimension to fit in the SingleCellExperiment
object.
pc.path <- bfcrpath(bfc, file.path("https://content.cruk.cam.ac.uk/", "jmlab/chimera_wt_data/corrected_pcas.rds")) pc.list <- readRDS(pc.path) #following match induces NA values deliberately pc.75 <- pc.list[["E7.5"]][match(colData(sce)$cell, rownames(pc.list[["E7.5"]])),] pc.85 <- pc.list[["E8.5"]][match(colData(sce)$cell, rownames(pc.list[["E8.5"]])),] rownames(pc.75) <- rownames(pc.85) <- colData(sce)$cell reducedDim(sce, "pca.corrected.E7.5") <- pc.75 reducedDim(sce, "pca.corrected.E8.5") <- pc.85 head(pc.75[,1:5])
We now save the data, splitting the large SingleCellExperiment
object into smaller, sample-wise objects.
We then upload these smaller files to r Biocpkg("ExperimentHub")
.
Splitting up the data allows easier access of specific subsets of the data, and also allows use of the data on low-memory machines.
base <- file.path("MouseGastrulationData", "wt-chimera", "1.0.0") dir.create(base, recursive=TRUE, showWarnings=FALSE) #rowdata the same for all data saveRDS(rowData(sce), file=paste0(base, "/rowdata.rds")) for(samp in unique(sce$sample)){ sub <- sce[, sce$sample == samp] saveRDS(counts(sub), file=paste0(base, "/counts-processed-sample", samp, ".rds")) saveRDS(colData(sub), file=paste0(base, "/coldata-sample", samp, ".rds")) saveRDS(sizeFactors(sub), file=paste0(base, "/sizefac-sample", samp, ".rds")) saveRDS(reducedDims(sub), file=paste0(base, "/reduced-dims-sample", samp, ".rds")) }
We obtain the raw count data through the r Biocpkg("BiocFileCache")
framework.
Each file contains the raw (unfiltered) count matrix from CellRanger for each sample, with swapped molecules removed via DropletUtils::swappedDrops
.
sample.paths <- character(10) for (i in seq_along(sample.paths)) { fname <- sprintf("sample_%s_unswapped.mtx.gz", i) sample.paths[i] <- bfcrpath(bfc, file.path("https://content.cruk.cam.ac.uk/", "jmlab/chimera_wt_data/unfiltered", fname)) } barcode.paths <- character(10) for (i in seq_along(barcode.paths)) { fname <- sprintf("barcodes_%s_unswapped.tsv.gz", i) barcode.paths[i] <- bfcrpath(bfc, file.path("https://content.cruk.cam.ac.uk/", "jmlab/chimera_wt_data/unfiltered", fname)) }
We read in each sparse matrix and serialize it into a sample-specific RDS file.
collected <- vector("list", length(sample.paths)) for (i in seq_along(collected)) { curmat <- readMM(sample.paths[i]) colnames(curmat) <- read.table(barcode.paths[i], stringsAsFactors=FALSE)[,1] saveRDS(curmat, file=file.path(base, sprintf("counts-raw-sample%i.rds", i))) }
Note that the row-level metadata is the same in both the raw and processed data, and does not need to be re-acquired. Column names of the matrices are the 10x cell barcodes.
sessionInfo()
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.