In nukappa/dropseq: dropbead

library(dropbead)
library(data.table)
library(xlsx)
library(rJava)
library(xlsxjars)
library(Rsamtools)
reads.by.cell <- data.frame(fread('zcat < /mydaten/projects/hek3t3/data/ds_013_50fix/out_readcounts.txt.gz'))
dge.matrix=data.frame(fread("zcat < /mydaten/projects/hek3t3/data/ds_013_50fix/dge.txt.gz"), row.names = 1)

Assess basic read and downstream statistics

Extract read statistics

Dropbead provides two functions to readily extract useful statistics from the data. The first one, extractReadStatistics, takes the BAM file as an argument and computes the total number of (uniquely mapped) reads, the percentages of intronic/intergenic etc. The output is a data.frame as follows:

extractReadStatistics('/mydaten/projects/drosophila_brain/data/ds_032_WT_rep1/star_gene_exon_tagged_corrected.bam', yieldSize=10^6, sample.name='Drop-seq')

The read numbers are in million. The intronic, intergenic, coding and UTR reads do not get a gene annotation tag (GE:) and are discarded from the Drop-seq pipeline. The last column counts the total number of barcodes seen in the data, most of them being artifacts from sequencing errors.

Knee plot and the number of cells

Dropbead provides a function to computationally estimate the number of cells present in the sample, by calculating the inflection point of the cumulative fraction of reads against the cell barcodes. Assuming that reads.by.cell is the data.frame with the reads per cell (the output of BAMTagHistogram from the Drop-seq toolkit),

plotCumulativeFractionOfReads(reads.by.cell, 
                              cutoff = 10000, 
                              draw.infl.point = TRUE)

Extract downstream statistics

The following function creates a table with the basic downstream statistics, such as the cell number and the median number of reads, genes and UMIs per cell.

extractDownstreamStatistics('~/dropseq-data/', min.umi=250, read.stats=NULL)

The total number of UMIs in the DGE is given in millions. The PCR column provides the reads to UMIs ratio, so that the smaller, the better. Passing the data.frame output of the extractReadStatistics function into the read.stats argument provides a more precise calculation of the PCR over-amplification bias.

Mixed species experiments

Dropbead offers two main classes to store samples, the SingleSpeciesSample class and the MixedSpeciesSample class, the latter containing the former.

Mixed species experiments are important to estimate the number of doublets in the samples. Dropbead assumes that a digital gene expression matrix (DGE) has already been generated and the genes for the two species are separated by a prefix. For mixed human/mouse samples the prefixes are hg_ and mm_ respectively. Assuming that the DGE has been loaded in dge.matrix,

# The object containing the sample
mms <- new("MixedSpeciesSample", 
           species1="human", 
           species2="mouse",
           dge=dge.matrix)

The number of cells and genes are stored and automatically updated while using dropbead's functions

length(mms@cells) # number of cells in the sample
length(mms@genes) # number of genes detected

Some of the barcodes might correspond to the same cells (essentially after using the DetectBeadSynthesisErrors Drop-seq tool) and have to be collapsed. Such cells are identified by

listCellsToCollapse(mms)

In the example above there are 5 pairs of cells which need to be collapsed. This is straightforward with using

mms <- collapseCellsByBarcode(mms)

We can verify that the number of cells is now the correct one

length(mms@cells)

Handling mixed species samples

Calling classifyCellsAndDoublets separates the species and returns a data.frame with number of transcripts per cell per species. The threshold controls the purity of the resulting cells, while min.trans is the minimum number of UMIs required to keep a cell.

head(classifyCellsAndDoublets(mms, 
                              threshold = 0.9, 
                              min.trans = 1000))

The output of classifyCellsAndDoublets can be directly send for plotting

plotCellTypes(classifyCellsAndDoublets(mms, threshold = 0.9, min.trans = 5000))

The number of genes per cell is computed via

head(computeGenesPerCell(mms))

and similarly for transcripts (UMIs) with the computeTranscriptsPerCell function. Their output can be send directly for plotting and visualization

plotViolin(computeTranscriptsPerCell(mms), 
           attribute = "UMIs")
plotHistogram(computeTranscriptsPerCell(mms), 
              attribute = 'UMIs')

The above functions are polymorphic and can be also used for SingleSpeciesSample objects. Splitting the samples is always performed by internal functions automatically. If the user wants to restrict to only one species, this is done by the splitMixedSpeciesSampleToSingleSpecies function. which returns a list of the two SingleSpeciesSample objects.

# Extracting the human cells as a separate sample for further analysis
h <- splitMixedSpeciesSampleToSingleSpecies(mms,
                                            threshold = 0.9)[[1]]
class(h)

Single species

Filtering functions

There are a couple of functions to remove low quality cells and genes, such as

h.f1 <- keepBestCells(h, num.cells = 100) # keep only the top 100 cells
h.f2 <- keepBestCells(h, min.num.trans = 1000) # keep cells with at least 1000 UMIs
h.f3 <- removeLowQualityCells(h, min.genes = 2000) # remove cells which don't express at least 2000 genes
h.f4 <- removeLowQualityGenes(h, min.cells = 3) # remove genes which are not expressed in at least 3 cells

with obvious usage.

Comparing gene expression measurements

Reproducibility and correlations of different samples are easily assessed via the compareGeneExpressionLevels function, for instance,

compareGeneExpressionLevels(h.f2, h.f1, 
                            name1 = 'Drop-seq with >= 1000 UMIs per cell',
                            name2 = 'Drop-seq only 100 cells',
                            method = 'pearson')

If bulk data is available then its correlation pwith the Drop-seq sample is directly assessed with the function compareSingleCellsAgainstBulk. Note that bulk has to be in a 1-column data.frame format with genes as the rownames and RPKM values (as default, raw counts are also accepted)

compareSingleCellsAgainstBulk(h.f1, 
                              log2(h.f2@dge[, 1, drop=F]+1))

Further exploration

It is instructive to assess the mitochondrial content of the single cell samples. This is done as below and the percentages can be sent for plotting

head(computeMitochondrialPercentage(h), 5)
plotMitochondrialContent(list(100-computeMitochondrialPercentage(h.f1),
                              100-computeMitochondrialPercentage(h.f3)),
                         log_scale = FALSE)

Dropbead offers also an implementation of the algorithm described in Macosko et. al. 2015 for classification of the cell cycle phases. The following R-packages xlsx, rJava and xlsxjars are required for this