knitr::opts_chunk$set(echo = TRUE) knitr::opts_chunk$set(collapse = TRUE, comment = "#>") library(devtools) load_all("./")
Install the ExpDesign2021 package from github, if you haven't already. If you're knitting this on your own machine, you will want to load knitr too.
#install.packages("remotes") #install.packages("BiocManager") #BiocManager::install("VanAndelInstitute/ExpDesign2021") library(knitr)
To extract just the R code, you can use knitr::knit(input, tangle=TRUE):
# knitr::knit("project2.Rmd", tangle = TRUE) # [1] "project2.R"
Extra credit assignment: figure out whose lab the above came from
Click for hint
They're at the Gladstone Institute now, if memory serves.
For project 2 ("design your own damned experiment"), the class boldly rallied behind Richard Cassidy's suggestion of using single-cell RNA-seq data for an example dataset. This seemed like a great idea: it's big, it's messy, and it accompanies a glam paper for a blue-collar experiment, which is kind of cool. (The paper also covers single-cell vs. single-nucleus RNAseq comparisons, but we won't go there yet.) The cell mixture model (HEK293 human cells and NIH3T3 mouse cells) is straight- forward (mix two types of cells in two test tubes and then split).
Question: why is it called a 'barnyard' experiment?
Click for Answer
A traditional test of single-cell protocols is to mix cells from two species,
and then see if the results can be easily classified into one or the other.
A protocol which does a good job of encapsulating cells will make it easier
to classify each cell by species. Some 'barnyard' experiments have tossed in
canine or chicken cells for variety, but since it's such a common benchmark,
people took to calling any species-mixing single-cell experiment a 'barnyard'
experiment, and the name stuck.
The library preparation methods involved in the paper may bear explanation. Short-read RNA sequencing experiments require cDNA as their input, since most sequencers expect to call bases from DNA; as a consequence, one must first make libraries of cDNA molecules from fragmented mRNA. Numerous approaches exist, six of which are employed here.
The data is available from GEO although it's a bit unwieldy, weighing in at 300MB for the cell mixture model counts alone. Happily, we can load up the metadata without that.
The barcode, UMI, set or BUS format neatly encapsulates the process of transforming reads
resulting from a single-cell RNAseq experiment into counts of molecular barcodes
describing a cell, a molecule, and the likely genomic origin of that molecule.
It's one of the most compact representations possible for raw scRNAseq outputs,
and the figure shows how it facilitates direct comparisons. The Broad used their
scumi
tool instead, but the basic notion ("make everything comparable") and
the resulting sparse matrix of counts is similar in spirit.
# read in the cell names for the count matrix directly from GEO cells <- readLines(gzcon(url("https://ftp.ncbi.nlm.nih.gov/geo/series/GSE132nnn/GSE132044/suppl/GSE132044_mixture_hg19_mm10_cell.tsv.gz"))) # in English: "create a gzip connection from the URL at GEO, then read from it" # tidy things up library(stringr) library(tidyverse) # now decode the metadata from it tibble(name = cells) %>% # stringr::str_split takes strings and a split pattern mutate(experiment = str_split(name, "\\.", simplify=TRUE)[,1]) %>% # column 1 mutate(method = str_split(name, "\\.", simplify=TRUE)[,2]) %>% # column 2 mutate(cell = str_split(name, "\\.", simplify=TRUE)[,3]) -> # column 3 celltibble # the result is assigned to the object `celltibble` # how many cells per mixture were run with each method? with(celltibble, table(method, experiment))
Arguably, we don't need more than (say) 300 cells per method for this exercise, and it would be nice not to demolish your computers' RAM (even though we're not using Seurat or some piggish monstrosity like that). So we'll probably chop down the number of cells from which you resample, at least for this project. If you are familiar with block resampling, you already know where this is going.
We can do the same type of thing for the genes involved, albeit without the intention of chopping them down by much if at all:
# read in the gene names for the count matrix directly from GEO genes <- readLines(gzcon(url("https://ftp.ncbi.nlm.nih.gov/geo/series/GSE132nnn/GSE132044/suppl/GSE132044_mixture_hg19_mm10_gene.tsv.gz"))) # in English: "create a gzip connection from the URL at GEO, then read from it" # tidy things up library(stringr) library(tidyverse) # now decode the metadata from it tibble(name = genes) %>% # stringr::str_split(string, pattern_to_split_on) mutate(ensembl = str_split(name, "_", simplify=TRUE)[,2]) %>% # ENS[MUS]G name mutate(genome = str_split(name, "_", simplify=TRUE)[,3]) -> # genome assembly genetibble # assign the result to `genetibble` # how many genes per assembly? with(genetibble, table(genome))
Question: why didn't we use the gene symbols in columns 4:6?
Click for Answer
Because some ENSEMBL gene names map to multiple RefGene/HUGO symbols.
With the release of Bioconductor 3.14, the project includes a tidy single cell experiment (data structure) package, which is great since all other single cell data structures kind of suck. (No, seriously, you'll find out why eventually.) The package is, not coincidentally, called tidySingleCellExperiment:
if (!require("tidySingleCellExperiment")) { BiocManager::install("SingleCellExperiment") library(SingleCellExperiment) BiocManager::install("tidySingleCellExperiment") library(tidySingleCellExperiment) }
Above, I stated that maybe we don't need thousands of cells per method. (This is arguable; in exchange for crappy per-cell results, 10X and similar offer more cells at the price of more dropouts and less sensitivity per cell.) No worries, we'll just go ahead and downsample "enough" cells per method. It turns out this happens to me often enough that I wrote some code to do it:
Click for sample_umis() function code
# adapted from a SingleCellExperiment-centric method for CITEseq sample_umis <- function(umis, meta, block, ideal=300) { stopifnot(nrow(meta) == ncol(umis)) stopifnot(length(block) == nrow(meta)) pops <- sort(table(block)) samplesets <- split(seq_len(nrow(meta)), block) keep <- integer() for (set in names(samplesets)) { sset <- samplesets[[set]] cells <- length(sset) if (cells <= ideal) { pct <- 100 keep <- c(keep, sset) message("Kept ", cells, " cells (", pct, "%) of type ", set, ".") } else { kept <- sample(sset, size=ideal) pct <- round((ideal / cells) * 100) keep <- c(keep, kept) message("Kept ", ideal, " cells (", pct, "%) of type ", set, ".") } } pct <- round((length(keep) / ncol(umis)) * 100, 1) message("Kept ", length(keep), " (", pct, "%) of ", ncol(umis), " cells in ", length(samplesets), " blocks.") umis[, keep] }
Question: can you generate block-random samples from the metadata you have?
Click for answer
Easily, although you'd need to rewrite the above function to omit umis
.
It's possible your machine or instance would crash if you did the following (apparently nobody told IT that a laptop is a device to take RAM on a plane). I'm going to stick to the above "300 cells ought to be enough" and do this:
Click for gory details
# pre-downloaded MatrixMarket file from GEO library(Matrix) if (FALSE) { # put it into a column-sparse Matrix object to avoid wasting heaps of RAM umi_counts <- as(readMM("GSE132044_mixture_hg19_mm10_count_matrix.mtx.gz"), "dgCMatrix") # "read this as if it were already a column-sparse Matrix object" # never trust anyone, including yourself stopifnot(nrow(umi_counts) == nrow(genetibble)) stopifnot(ncol(umi_counts) == nrow(celltibble)) # bolt the dimensions back onto the data rownames(umi_counts) <- genetibble$name colnames(umi_counts) <- celltibble$name # create a basis for blocked downsampling celltibble %>% mutate(block = paste(method, experiment, sep="_")) -> blocktibble # block downsample downsampled <- sample_umis(umi_counts, blocktibble, blocktibble$block) # Kept 300 cells (9%) of type 10x-Chromium-v2_Mixture1. # Kept 300 cells (9%) of type 10x-Chromium-v2_Mixture2. # Kept 300 cells (84%) of type CEL-Seq2_Mixture1. # Kept 300 cells (86%) of type CEL-Seq2_Mixture2. # Kept 300 cells (12%) of type Drop-seq_Mixture1. # Kept 300 cells (8%) of type Drop-seq_Mixture2. # Kept 300 cells (10%) of type inDrops_Mixture1. # Kept 300 cells (12%) of type inDrops_Mixture2. # Kept 299 cells (100%) of type sci-RNA-seq_Mixture1. # Kept 300 cells (6%) of type sci-RNA-seq_Mixture2. # Kept 300 cells (18%) of type Seq-Well_Mixture1. # Kept 300 cells (30%) of type Seq-Well_Mixture2. # Kept 300 cells (88%) of type Smart-seq2_Mixture1. # Kept 300 cells (87%) of type Smart-seq2_Mixture2. # Kept 4199 (15.2%) of 27714 cells in 14 blocks. saveRDS(downsampled, file="barnyard_umi_sample.rds") # online at https://ttriche.github.io/RDS/barnyard_umi_sample.rds }
Right then, let's get to work. You might consider running through some of the steps in the tidySingleCellExperiment vignette to start with. If your laptop or Rstudio Cloud instance still blows up, let me know and I'll create some smaller-er downsamples (also a familiar practice).
Load the necessary packages
# need to install from Bioconductor, # unless we're going to dork around with individual genes right from the start library(SingleCellExperiment) library(tidySingleCellExperiment)
How the experiment object is created, or (faster) loaded from a URL:
if (FALSE) { # if you want to recreate the SingleCellExperiment: downsampled <- readRDS(url("https://ttriche.github.io/RDS/barnyard_umi_sample.rds")) # create the column (cell) annotation data frame column_data <- as(celltibble, "DataFrame") rownames(column_data) <- column_data$name column_data <- column_data[colnames(downsampled), ] # create the row (gene) annotation data frame row_data <- as(genetibble, "DataFrame") rownames(row_data) <- row_data$name row_data <- row_data[rownames(downsampled), ] # create a SingleCellExperiment barnyard <- SingleCellExperiment(SimpleList(counts=downsampled), rowData=row_data, colData=column_data) saveRDS(barnyard, file="barnyard.rds") } # if you just want to load it tidybarnyard <- tidy(readRDS(url("https://ttriche.github.io/RDS/barnyard.rds"))) show(tidybarnyard)
I'd suggest, as a first pass, trying to sort out which are the mouse 3T3 cells, and which are the human HEK293 cells. We can discuss this in class on Monday.
Question: What's the easiest way to distinguish the mouse and human cells?
Hint:
library(tidySingleCellExperiment) rowGenome <- rowData(tidybarnyard)$genome UMIs <- counts(tidybarnyard)
Question: Can you think of a way to distinguish male from female cells?
Hint:
# this requires a bit of domain knowledge, to be added
Question: Are the two answers above roughly equivalent? Why?
Bonus points if you can say where the two cell lines originally came from, and why the HEK cells don't have the usual PHI-non-compliant names from that time.
Question: what is a UMI and why does it matter in single-cell RNAseq data?
Load up either the UMI matrix or the SingleCellExperiment (tidy or untidy) and
look for the counts
assay. (You may need to just use counts(tidysce)
.) If
your computer survives neither of these, you can try using summary statistics,
but plan on getting an even smaller subsample to use for the actual Project.
Or, just go back and look at the BUStools figure.
Question: what shall we investigate for project 2?
I'm thinking the proportions of the mixtures, but I want to fit those first. Note that a t-test on proportions is not necessarily "proper"... !
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.