library(dropbead) library(data.table) library(xlsx) library(rJava) library(xlsxjars) library(Rsamtools) reads.by.cell <- data.frame(fread('zcat < /mydaten/projects/hek3t3/data/ds_013_50fix/out_readcounts.txt.gz')) dge.matrix=data.frame(fread("zcat < /mydaten/projects/hek3t3/data/ds_013_50fix/dge.txt.gz"), row.names = 1)
Dropbead
provides two functions to readily extract useful statistics from the
data. The first one, extractReadStatistics
, takes the BAM file as an argument
and computes the total number of (uniquely mapped) reads, the percentages
of intronic/intergenic etc. The output is a data.frame
as follows:
extractReadStatistics('/mydaten/projects/drosophila_brain/data/ds_032_WT_rep1/star_gene_exon_tagged_corrected.bam', yieldSize=10^6, sample.name='Drop-seq')
The read numbers are in million. The intronic, intergenic, coding and UTR reads
do not get a gene annotation tag (GE:
) and are discarded from the Drop-seq
pipeline. The last column counts the total number of barcodes seen in the data,
most of them being artifacts from sequencing errors.
Dropbead
provides a function to computationally estimate the number of cells
present in the sample, by calculating the inflection point of the cumulative
fraction of reads against the cell barcodes. Assuming that reads.by.cell
is the data.frame
with the reads per cell (the output of BAMTagHistogram
from the Drop-seq toolkit),
plotCumulativeFractionOfReads(reads.by.cell, cutoff = 10000, draw.infl.point = TRUE)
The following function creates a table with the basic downstream statistics, such as the cell number and the median number of reads, genes and UMIs per cell.
extractDownstreamStatistics('~/dropseq-data/', min.umi=250, read.stats=NULL)
The total number of UMIs in the DGE is given in millions.
The PCR column provides the reads to UMIs ratio, so that the smaller, the better.
Passing the data.frame
output of the extractReadStatistics
function into
the read.stats
argument provides a more precise calculation of the PCR
over-amplification bias.
Dropbead offers two main classes to store samples, the SingleSpeciesSample
class and the MixedSpeciesSample
class, the latter containing the former.
Mixed species experiments are important to estimate the number of doublets in
the samples. Dropbead assumes that a digital gene expression matrix (DGE)
has already been generated and the genes for the two species are separated
by a prefix. For mixed human/mouse samples the prefixes are hg_
and mm_
respectively. Assuming that the DGE has been loaded in dge.matrix
,
# The object containing the sample mms <- new("MixedSpeciesSample", species1="human", species2="mouse", dge=dge.matrix)
The number of cells and genes are stored and automatically updated while
using dropbead
's functions
length(mms@cells) # number of cells in the sample length(mms@genes) # number of genes detected
Some of the barcodes might correspond to the same cells (essentially after
using the DetectBeadSynthesisErrors
Drop-seq tool) and have to be collapsed.
Such cells are identified by
listCellsToCollapse(mms)
In the example above there are 5 pairs of cells which need to be collapsed. This is straightforward with using
mms <- collapseCellsByBarcode(mms)
We can verify that the number of cells is now the correct one
length(mms@cells)
Calling classifyCellsAndDoublets
separates the species and returns a data.frame
with
number of transcripts per cell per species. The threshold
controls the purity of the
resulting cells, while min.trans
is the minimum number of UMIs required to keep a cell.
head(classifyCellsAndDoublets(mms, threshold = 0.9, min.trans = 1000))
The output of classifyCellsAndDoublets
can be directly send for plotting
plotCellTypes(classifyCellsAndDoublets(mms, threshold = 0.9, min.trans = 5000))
The number of genes per cell is computed via
head(computeGenesPerCell(mms))
and similarly for transcripts (UMIs) with the computeTranscriptsPerCell
function.
Their output can be send directly for plotting and visualization
plotViolin(computeTranscriptsPerCell(mms), attribute = "UMIs") plotHistogram(computeTranscriptsPerCell(mms), attribute = 'UMIs')
The above functions are polymorphic and can be also used for SingleSpeciesSample
objects. Splitting the samples is always performed by internal functions automatically.
If the user wants to restrict to only one species, this is done by the
splitMixedSpeciesSampleToSingleSpecies
function. which returns a list of the two
SingleSpeciesSample
objects.
# Extracting the human cells as a separate sample for further analysis h <- splitMixedSpeciesSampleToSingleSpecies(mms, threshold = 0.9)[[1]] class(h)
There are a couple of functions to remove low quality cells and genes, such as
h.f1 <- keepBestCells(h, num.cells = 100) # keep only the top 100 cells h.f2 <- keepBestCells(h, min.num.trans = 1000) # keep cells with at least 1000 UMIs h.f3 <- removeLowQualityCells(h, min.genes = 2000) # remove cells which don't express at least 2000 genes h.f4 <- removeLowQualityGenes(h, min.cells = 3) # remove genes which are not expressed in at least 3 cells
with obvious usage.
Reproducibility and correlations of different samples are easily assessed
via the compareGeneExpressionLevels
function, for instance,
compareGeneExpressionLevels(h.f2, h.f1, name1 = 'Drop-seq with >= 1000 UMIs per cell', name2 = 'Drop-seq only 100 cells', method = 'pearson')
If bulk data is available then its correlation pwith the Drop-seq sample is
directly assessed with the function compareSingleCellsAgainstBulk
. Note that
bulk has to be in a 1-column data.frame
format with genes as the rownames
and RPKM values (as default, raw counts are also accepted)
compareSingleCellsAgainstBulk(h.f1, log2(h.f2@dge[, 1, drop=F]+1))
It is instructive to assess the mitochondrial content of the single cell samples. This is done as below and the percentages can be sent for plotting
head(computeMitochondrialPercentage(h), 5) plotMitochondrialContent(list(100-computeMitochondrialPercentage(h.f1), 100-computeMitochondrialPercentage(h.f3)), log_scale = FALSE)
Dropbead
offers also an implementation of the algorithm described in
Macosko et. al. 2015 for classification of the cell cycle phases. The following
R-packages xlsx
, rJava
and xlsxjars
are required for this
phases <- assignCellCyclePhases(h)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.