knitr::opts_chunk$set( collapse = TRUE, comment = "#>" )
\newpage
library("data.table") library("SingleCellExperiment")
Single-cell transcriptome sequencing data are subject to substantial technical variation and batch effects that can confound the classification of cellular sub-types. Unfortunately, current clustering algorithms do not account for this uncertainty. To address this shortcoming, we have developed a noise perturbation algorithm called BEARscc that is designed to determine the extent to which classifications by existing clustering algorithms are robust to observed technical variation.
BEARscc makes use of spike-in measurements to model technical variance as a function of gene expression and technical dropout effects on lowly expressed genes. In our benchmarks, we found that BEARscc accurately models read count fluctuations and drop-out effects across transcripts with diverse expression levels. Applying our approach to publicly available single-cell transcriptome data of mouse brain and intestine, we have demonstrated that BEARscc identified cells that cluster consistently, irrespective of technical variation. For more details, see the manuscript on bioRxiv.
Importantly, BEARscc should not be considered another clustering algorithm. Specifically, this package is designed to supply users with an organic tool to evaluate and explore the impact of noise on their single cell cluster interpretations. The package provides users with a way to clarify the precision of a single cell experiment with respect to grouping cells into clusters that are biologically meaningful. In this way, BEARscc allows users to achieve confidence in clusters and relevant cells that consistently cluster together invariant to the noise inherent to a single cell experiment. Conversely, the algorithm provides a mechanism to identify cells and clusters which cannot be resolved given the precision of the experiment in conjuction with the clustering algorithm of choice. It is our hope that BEARscc will enable users to proceed with clusters, or biological groups, in which they are confident are robust to noise and in which they have an intimate understanding of those cells and clusters that may be less precisely assigned to the putative biological role.
BEARscc is now available on Bioconductor and can be installed using the syntax below.
if (!requireNamespace("BiocManager", quietly=TRUE)) install.packages("BiocManager") BiocManager::install("BEARscc")
BEARscc and its associated manuscript are currently under review for publication at a peer-reviewed journal. For now, please cite the bioRxiv pre-print:
Severson, DT. Owen, RP. White, MJ. Lu, X. Schuster-Boeckler, B. BEARscc determines robustness of single-cell clusters using simulated technical replicates. doi: https://doi.org/10.1101/118919
\newpage
BEARscc relies upon spike-in count measurements in single-cell transcriptome experiments to estimate experimental noise and produce simulated technical replicates to provide a quantitative understanding of the robustness of proposed single cell cluster labels to experimental noise. In principal, the algorithm is compatible with any clustering algorithm. The following should provide users with a comprehensive tutorial of the use and utlity of BEARscc as a tool for vetting single cell clusters with respect to experimental noise.
Before getting started, we need to load some example single cell data.BEARscc is equipped with a set of sample data for the purpose of testing functions, examples in help files, and this nifty tutorial. The data may be loaded as follows:
library("BEARscc") data("BEARscc_examples")
The loaded file BEARscc_examples
is equipped with separate data.frame
objects including ERCC spike-in observations (ERCC.counts.df
),
endogenous count observations (data.counts.df
), and the expected or
actual spike-in concentrations (ERCC.meta.df
) as well as a
SingleCellExperiment
object that contains all of the above seperate
components (BEAR_examples.sce
) as shown below:
head(ERCC.counts.df[,1:2]) head(data.counts.df[,1:2]) head(ERCC.meta.df) BEAR_examples.sce
In the event we were working with a new set of data, the spike-in
concentrations data.frame
can be computed from industry reported
concentrations and the relevant dilution protocol utilized in the
experiment. Count tables would need to be mapped and counted with
preferred software, and the spike-in control counts (ERCC or otherwise)
would need to be identified from the endogenous counts.
Below is how one would create a SingleCellExperiment
object from
spike-in count, endogenous count, and spike-in concentration data.frame
objects. Note how we create the observed_expression
assay object in the
following code. This object is essential in that all estimation and simulation
of replicates occurs assuming these are the observed counts (normalized or
otherwise). Without them BEARscc will throw an error indicating that
"observed_expression"
not in names(assays(<SingleCellExperiment>))
.
Also, data.counts.df
in this example includes both spike-in genes and
endogenous genes. In general, we recommend spike-in genes be simulated
with endogenous genes as a control, and this will be done by default
when the SingleCellExperiment
object is used.
BEAR.se <- SummarizedExperiment(list(counts=as.matrix(data.counts.df))) BEAR_examples.sce<-as(BEAR.se, "SingleCellExperiment") metadata(BEAR_examples.sce)<-list(spikeConcentrations=ERCC.meta.df) assay(BEAR_examples.sce, "observed_expression")<-counts(BEAR_examples.sce) altExp(BEAR_examples.sce, "ERCC_spikes")<-BEAR_examples.sce[grepl("^ERCC-", rownames(BEAR_examples.sce)),]
Running the above code should yield a SingleCellExperiment
object identical
to the one that loads with data("BEARscc_examples")
.
We will now estimate the single-cell noise present in the experiment using
spike-in controls. In this tutorial, we rely upon a subsample of artificial
control data found in BEARscc_examples
; however, users are encouraged to
work through the tutorial with their own single cell data provided some form
of spike-ins were included in the experiment. Building the noise models with
BEARscc is relatively straightforward with estimate_noiseparameters()
. We
simply provide the function with the now adequately annotated
SingleCellExperiment
object, BEAR_examples.sce
. Here, the parameter
`alpha_resolution is set to 0.1 to speed things along, but we suggest values
between 0.001 and 0.01 be used in real applications of BEARscc.
BEAR_examples.sce <- estimate_noiseparameters(BEAR_examples.sce, max_cumprob=0.9999, alpha_resolution = 0.1, bins=10, write.noise.model=FALSE, file="BEAR_examples")
Several options exist for estimate_noiseparameters()
. These are
fully documented in the help page ?estimate_noiseparameters
.
Following estimation of noise, the parameters computed are then used to
generate a simulate a technical replicate. Here we will simulate replicates
on our local computer, but frequently users will want to utilize the methods
described in Section 2.4. Notably the necessary parameters are conveniently
stored in the metadata
of our SingleCellExpression
object,
BEAR_examples.sce
following noise estimation, and so we simply run:
BEAR_examples.sce <- simulate_replicates(BEAR_examples.sce, max_cumprob=0.9999, n=10)
Recall that BEAR_examples.sce
is our SingleCellExperiment
object that we
recently annoated with model parameters describing experimental noise using the
function estimate_noiseparameters()
and note that the variable n
is the
desired number of simulated technical replicates (e.g. 10). Finally, the
maxcum_prob
is identical to its use in the noise estimation. If the user
deviated from the default parameter, it is highly recommended that this value
be identical to the value utilized in estimate_noiseparmaters()
The resulting
object is a list, where each element is a simulated technical replicate,
and one element is the original counts matrix.
For larger datasets, we set write.noise.model=TRUE
when running
estimate_noiseparameters()
and copy the written bayesian drop-out and
noise estimate files with the observed counts table to a high performance
computing environment. The following code provides an example:
BEAR_examples.sce <- estimate_noiseparameters(BEAR_examples.sce, write.noise.model=TRUE, file="tutorial_example", model_view=c("Observed","Optimized"))
After running the above code, then within the current working directory
(if unsure use getwd()
), we should find the two tab-delimited files
that together completely describe the BEARscc noise model. These are the
parameters describing the mixed model of technical variation
(tutorial_example_parameters4randomize.xls
, see Section 3.1.1) and the
parameters describing the drop-out model
(tutorial_example_bayesianestimates.xls
, see Section 3.1.2). In addition,
we should find our "observed_expression"
matrix in the form of
tutorial_example_counts4clusterperturbation.xls
. Note that the xls
subscript just allows users to quickly open these tab-delimited files in
Microsoft Excel if desired, but these can be readily viewed on the terminal
or in simple text editors as well.
With the original counts file and noise model prepared, we then copy these files to our high performance compute cluster. The following code provides a sense of how to proceed once these files have been copied to a high performance cluster; however, the job submission structure of each user's environment will dictate the precise syntax for the following procedure.
The script HPC_generate_noise_matrices
contains an analogous
simulate_replicates()
for a high performance computational node.
To utilize these functions for simulating technical replicates on a
cluster, please install BEARscc on the relevant cluster.
The user should write an R script to load the BEARscc library and
run the clustering. The following code provides a suggested format for
both calling the R script with a bash job script and the relevant R
invocation of BEARscc and may also be
found as a stand alone script in inst/example/HPC_run_example.R
.
Our cluster utilizes a job submission format that interacts seamlessly
with bash code; therefore, the $SGE_TASKID
represents an array id for
jobs to conveniently generate 100 simulated technical replicates in a
single job array. In any case, this variable should be treated as the
index for the simulated technical replicate as we recommend from
experience that users generate 50 to 100 such simulated technical replicates
to reach a stable noise consensus matrix solution.
The following bash code could be included in one such job script:
```{bash run_sim_rep_HPC, eval=FALSE, include=TRUE} Rscript --vanilla HPC_run_example.R $SGE_TASK_ID
Noting that the file `HPC_run_example.R` contains the following suggested code to run BEARscc: ```r library("BEARscc") #### Load data #### ITERATION<-commandArgs(trailingOnly=TRUE)[1] counts.df<-read.delim("tutorial_example_counts4clusterperturbation.xls") #filter out zero counts to speed up algorithm counts.df<-counts.df[rowSums(counts.df)>0,] probs4detection<-fread("tutorial_example_bayesianestimates.xls") parameters<-fread("tutorial_example_parameters4randomize.xls") #### Simulate replicates #### counts.error<-HPC_simulate_replicates(counts_matrix=counts.df, dropout_parameters=dropout_parameters, spikein_parameters=spikein_parameters) write.table(counts.error, file=paste("simulated_replicates/", paste(ITERATION,"sim_replicate_counts.txt",sep="_"), sep=""), quote =FALSE, row.names=TRUE) #########
The script generates seperate simulated technical replicate files, which can be loaded into R as a list for clustering or, in the case of more computationally intense clustering algorithms, re-clustered individually in a high performance compute environment.
After generating simulated technical replicates, these should be re-clustered
using the clustering method applied to the original dataset.
For simplicity, here we use hierarchical clustering on a euclidean
distance metric to identify two clusters. In our experience,
some published clustering algorithms are sensitive to cell order,
so we suggest scrambling the order of cells for each noise iteration
as we do below in the function, recluster()
. The function below will
serve our purposes for this tutorial, but BEARscc may be used with
any clustering algorithm (e.g. SC3, RaceID2, or BackSPIN).
To quickly recluster a list of simulated technical replicates, we define a reclustering function:
recluster <- function(x) { x <- data.frame(x) scramble <- sample(colnames(x), size=length(colnames(x)), replace=FALSE) x <- x[,scramble] clust <- hclust(dist(t(x), method="euclidean"),method="complete") clust <- cutree(clust,2) data.frame(clust) }
First, we need to determine how the observed data clusters under our algorithm
of choice (e.g. recluster()
, which is a simple hiearchical clustering
for illustrative purposes). These are the clusters that can be evaluated by
BEARscc
and compared to BEARscc
meta-clustering:
OG_clusters<-recluster(data.frame(assay(BEAR_examples.sce, "observed_expression"))) colnames(OG_clusters)<-"Original"
We then apply the function recluster()
to all simulated technical replicates
and the original counts matrix and manipulate the list into a data.frame
.
cluster.list<-lapply(metadata(BEAR_examples.sce)$simulated_replicates, `recluster`) clusters.df<-do.call("cbind", cluster.list) colnames(clusters.df)<-names(cluster.list)
Note: If running clustering algorithms on a seperate high performance cluster,
the user should retrieve labels and format as a data.frame
of
cluster labels, where the last column must be the original cluster labels
derived from the observed count data. As an example, examine the file,
inst/example/example_clusters.tsv.
Using the cluster labels file generated by clustering simulated technical
replicates on a high performance compute environment or with recluster()
as described above , we can generate a noise consensus matrix as shown below.
An illustrative three rows and columns of an example noise consensus matrix
are displayed:
noise_consensus <- compute_consensus(clusters.df) head(noise_consensus[,1:3], n=3)
Using the aheatmap()
function in the NMF
library, the consensus matrix
result of 10 simulated replicates by BEARscc:
To reproduce the plot run:
library("NMF") aheatmap(noise_consensus, breaks=0.5)
Although, 10 simulated replicates is sparse (we recommend 50 to 100), we can already see that these samples likely belong to a single cluster. Indeed, these samples were prepared from a single ground truth of dilute whole tissue brain RNA-seq data from two batches of experimental data. If desired, we could annotate the above heatmap with relevant metadata concerning sample batch, origin, etc.
In order to further interpret the noise consensus, we have defined three cluster (and analagous cell) metrics. Stability indicates the propensity for a putative cluster to contain the same cells across noise-injected counts matrices. Promiscuity indicates a tendency for cells in a putative cluster to associate with other clusters across noise-injected counts matrices. Score represents the promiscuity substracted from the stability.
We have found it useful to inform the optimal number of clusters in terms
of resiliance to noise by examining these metrics by cutting hierarchical
clustering dendograms of the noise consensus and comparing the results to
the original clustering labels. To do this create a vector containing
each number of clusters one wishes to examine (the function automatically
determines the results for the dataset as a single cluster) and then
cluster the consensus with various cluster numbers using cluster_consensus()
:
vector <- seq(from=2, to=5, by=1) BEARscc_clusts.df <- cluster_consensus(noise_consensus,vector)
We add the original clustering to the data.frame
to evaluate its
robustness as well as the suggested BEARscc clusters:
BEARscc_clusts.df <- cbind(BEARscc_clusts.df, Original=OG_clusters)
Now we can compute cluster metrics for each of the BEARscc cluster
number scenarious and the original clustering; indeed, any cluster labels
of the users choosing could be supplied to vet with the information provided
by the noise consensus. We accomplish this by running the command and
displaying the first 5 rows of the resulting melted data.frame
:
cluster_scores.df <- report_cluster_metrics(BEARscc_clusts.df, noise_consensus, plot=FALSE) head(cluster_scores.df, n=5)
Above is a melted data.frame
that displays the name of each cluster,
the size of each cluster, the metric (Score, Promiscuity, Stability),
the value of each metric for the respective cluster and clustering,
the clustering in question (1,2,...,Original), whether the cluster
consists of only one cell, and finally the mean of each metric
across all clusters in a clustering.
In the previous example, we display all metrics for generating 1 clusters from the data given the previously computed noise consensus from 10 simulated technical replicates and the same for the score metric generating 2 clusters from the data. Importantly, by definition, 1 cluster scenarios have a promiscuity value of 0. Observe that the score for the cluster in the 1 cluster scenario is much larger than either cluster of the 2 cluster scenario, and that this is reflected in the average clustering column.
We can examine the BEARcc metrics for the original cluster using ggplot2
and by subsetting the data.frame
to the original cluster and the score
metric:
library("ggplot2") library("cowplot") original_scores.df<-cluster_scores.df[ cluster_scores.df$Clustering=="Original",] ggplot(original_scores.df[original_scores.df$Metric=="Score",], aes(x=`Cluster.identity`, y=Value) )+ geom_bar(aes(fill=`Cluster.identity`), stat="identity")+ xlab("Cluster identity")+ylab("Cluster score")+ ggtitle("Original cluster scores")+guides(fill=FALSE)
We can see that this initial clustering is terrible, which makes sense given the ground truth consists of a single biological entity. The stability and promiscuity metrics bear this out:
ggplot(original_scores.df[original_scores.df$Metric=="Stability",], aes(x=`Cluster.identity`, y=Value) )+ geom_bar(aes(fill=`Cluster.identity`), stat="identity")+ xlab("Cluster identity")+ylab("Cluster stability")+ ggtitle("Original cluster stability")+guides(fill=FALSE)
The high stability exhibited in the example above is not suprising as samples in this example should have strong association with one another. However, the promuscuity below reveals the reason for the low score:
ggplot(original_scores.df[original_scores.df$Metric=="Promiscuity",], aes(x=`Cluster.identity`, y=Value) )+ geom_bar(aes(fill=`Cluster.identity`), stat="identity")+ xlab("Cluster identity")+ylab("Cluster promiscuity")+ ggtitle("Original cluster promiscuity")+guides(fill=FALSE)
Despite the high stability, the samples within each cluster have high association with cells in the other cluster, which results in a high promiscuity reported from the noise consensus. As a net result, the scores in the original 2 cluster case for each cluster are subpar. Again, this is consistent with the ground truth of the example data.
Completely analagous to cluster metrics, the extent to which cells belong within a given cluster may be evaluated with respect to the noise consensus. Below we demonstrate how to compute the cell metrics and display 4 cells for illustrative purposes.
cell_scores.df <- report_cell_metrics(BEARscc_clusts.df, noise_consensus) head(cell_scores.df, n=4)
The output is a melted data.frame
that displays the name of each cluster
to which the cell belongs, the cell label, the size of each cluster,
the metric (Score, Promiscuity, Stability), the value for each metric,
and finally the clustering in question (1,2,...,Original).
As with cluster metrics, these results can be plotted to visualize cells
in the context of the original clusters using ggplot2
. The score metric
is plotted below to illustrate this:
original_cell_scores.df<-cell_scores.df[cell_scores.df$Clustering=="Original",] ggplot(original_cell_scores.df[original_cell_scores.df$Metric=="Score",], aes(x=factor(`Cluster.identity`), y=Value) )+ geom_jitter(aes(color=factor(`Cluster.identity`)), stat="identity")+ xlab("Cluster identity")+ylab("Cell scores")+ ggtitle("Original clustering cell scores")+guides(color=FALSE)
While BEARscc certainly does not claim to provide a definitive solution to the question concerning the number of clusters by any means, we provide what we believe to be a useful perspective on the matter. Specifically, we have found that by examining the cluster metrics across various hiearchical clusterings of the BEARscc noise consensus, the cluster number $k$ with the highest score tends to provide a number of clusters that resembles ground truth more closely than simple gene sampling or relevant algorithms along as evidenced by experiments in control samples, c. elegans , and murine brain data (see our manuscript on bioRxiv. Importantly, this only provides a heuristic for determining cluster number $k$ that takes into count the inherent noise of the single cell experiment, and the heuristic is dependent upon the clustering algorithm of choosing (the better the utilized clustering algorithm, the better the BEARscc $k$ heuristic) and represents a form of "meta-clustering" rather than a new way to determine the number of clusters, $k$, per se. As an illustration of utilizing this heuristic, we plot the average cluster score values for various cluster number scenarios below:
ggplot(cluster_scores.df[cluster_scores.df$Metric=="Score",], aes(x=`Clustering`, y=Value) )+ geom_boxplot(aes(fill=Clustering, color=`Clustering`))+ xlab("Clustering")+ ylab("Cluster score distribution")+ ggtitle("Original cluster scores for each clustering")+ guides(fill=FALSE, color=FALSE)
As we can see, the 1 cluster scenario provides the best cluster score as determined by the noise consensus. As mentioned previously, the single cluster solution resembles the biological ground truth.
\newpage
In order to simulate technical replicates, BEARscc first builds a statistical
model of expression variance and drop-out frequency, which is dependent only
on observed gene expression. The parameters of this model are estimated from
spike-in counts. Expression-dependent variance is approximated by fitting
read counts of each spike-in transcript across cells to a mixture model
comprised of a Poisson and negative binomial distribution (Section 3.1.1).
The drop-out model (Section 3.1.2) in BEARscc has two distinct parts:
the \emph{drop-out injection distribution} models the likelihood that a
given transcript concentration will result in a drop-out, and the
\emph{drop-out recovery distribution} models the likelihood that an observed
drop-out resulted from a given transcript concentration. The drop-out
injection distribution is taken to be the observed drop-out rate in spike-in
controls as a function of actual spike-in transcript concentration.
This distribution is then used to estimate the drop-out recovery distribution
density via Bayes' theorem and a an empirically informed set of priors and
assumptions. Briefly, BEARscc utilizes the drop-out injection distribution
and the number of observed zeroes for each endogenous gene to infer a
gene-specific probability distribution describing the likelihood that an
observed drop-out should in fact have been some non-zero value, given
the drop-out rate of the endogenous gene. This entire process is facilitated
by the function, estimate_noiseparameters()
.
In the second step, BEARscc generates simulated technical replicates by
applying the models described in the first step (Section 3.2). For every
observed count in the range of values where drop-outs occurred amongst the
spike-in transcripts, BEARscc uses the drop-out injection distribution from
Step 1 to determine whether to convert the count to zero. For observations
where the count is zero, the drop-out recovery distribution is used to
estimate a new value, given the overall drop-out frequency for the gene
(Section 3.2). Next, BEARscc substitutes all values larger than zero with a
value generated from the derived model of expression variability,
parameterized to the observed count for that gene. This procedure can then be
repeated any number of times to generate a collection of simulated technical
replicates. This step is carried out by create_noiseinjected_counts()
.
In the third step, the simulated technical replicates are then re-clustered,
using exactly the same method as for the observed data; this re-clustering
for each simulated technical replicate is described as an
\emph{association matrix} where each element indicates whether two cells
share a cluster identity (1) or cluster apart from each other (0).
The association matrices for each simulated technical replicate are averaged
to form a noise consensus matrix that can be easily interpreted (Section 3.3).
This is accomplished with the function compute_consensus()
. Each element of
the noise consensus matrix represents the fraction of simulated technical
replicates that, upon applying the clustering method of choice, resulted in
two cells clustering together (the \emph{association frequency}). Then, the
functions report_cell_metrics()
and report_cluster_metrics()
may be used
to explore and quantitate the noise consensus matrix at the cell sample
and cluster levels, respectively.
As mentioned previously, BEARscc uses spike-ins to estimate the noise of the
experiment for the purpose of producing simulated technical replicates.
BEARscc models overall technical variation with a mixture-model
(Section 3.1.1) and inferred drop-out effects (Section 3.1.2) independently
using the spike-in observations. However, a single function in
BEARscc estimate_noiseparameters()
accomplishes this task.
Technical variance is modeled in BEARsccby fitting a single parameter mixture model, $Z(c)$, to the spike-ins' observed count distributions. The noise model is fit independently for each spike-in transcript and subsequently regressed onto spike-in mean expression to define a generalized noise model. This is accomplished in three steps:
\begin{enumerate} \item Define a mixture model composed of \emph{poisson} and \emph{negative binomial} random variables: \begin{equation} Z \sim (1-\alpha)Pois(\mu)+\alphaNBin(\mu,\sigma) \label{eq:mixture_model}\end{equation} \item Empirically fit the parameter, $\alpha_{i}$, in a spike-in specific mixture-model, $Z_{i}$, to the observed distribution of counts for each ERCC spike-in transcript, $i$, where $\mu_i$ and $\sigma_i$ are the observed mean and variance of the given spike-in. The parameter, $\alpha_{i}$, is chosen such that the error between the observed and mixture-model is minimized. \item Generalize the mixture-model by regressing $\alpha_{i}$ parameters and the observed variance, $\sigma_i$, onto the observed spike-in mean expression, $\mu_i$. Thus the mixture model describing the noise observed in ERCC transcripts is defined solely by $\mu$, which is treated as the count transformation parameter, $c$, in the generation of simulated technical replicates. \end{enumerate}
In step 2, a mixture model distribution is defined for each spike-in, $i$: \begin{equation} Z_{i}(\alpha_{i},\mu_{i},\sigma_{i})\sim (1-\alpha_{i})Pois(\mu_i )+ \alpha_{i}NBin(\mu_i,\sigma_i ). \end{equation} The distribution, $Z_i$, is fit to the observed counts of the respective spike-in, where $\alpha_i$ is an empirically fitted parameter, such that the $\alpha_i$ minimizes the difference between the observed count distribution of the spike-in and the respective fitted model, $Z_i$. Specifically, for each spike-in transcript, $\mu_i$ and $\sigma_i$ are taken to be the mean and standard deviation, respectively, of the observed counts for spike-in transcript, $i$. Then, $\alpha_i$ is computed by empirical parameter optimization; $\alpha_i$ is taken to be the $\alpha_{i,j}$ in the mixture-model, \begin{equation} Z_{i,j}(\alpha_{i,j},\mu_i,\sigma_i)\sim (1-\alpha_{i,j}) Pois(\mu_i )+\alpha_{i,j}NBin(\mu_i,\sigma_i ), \end{equation} found to have the least absolute total difference between the observed count density and the density of the fitted model, $Z_i$. In the case of ties, the minimum $\alpha_{i,j}$ is chosen.
In step 3, $\alpha(c)$ is then defined with a linear fit, $\alpha_i=\alphalog2(\mu_i)+b$. $\sigma(c)$ was similarly defined, $log2(\sigma_i) = \alphalog2(\mu_i)+b$. In this way, the observed distribution of counts in spike-in transcripts defines the single parameter mixture-model, $Z(c)$, used to transform counts during generation of simulated technical replicates:
\begin{equation} Z(c) \sim (1-\alpha(c))Pois(c)+\alpha(c)NBin(c,\sigma(c)) \label{eq:noisefunction}\end{equation}
During technical replicate simulation, the parameter $c$ is set to the observed count value, $a$, and the transformed count in the simulated replicate was determined by sampling a single value from $Z(c=a)$.
A model of the drop-outs is developed by BEARscc in order to inform the permutation of zeros during noise injection. The observed zeros in spike-in transcripts as a function of actual transcript concentration and Bayes' theorem are used to define two models: the \emph{drop-out injection distribution} and the \emph{drop-out recovery distribution}.
The drop-out injection distribution is described by $Prob(X=0 | Y=y)$, where $X$ is the distribution of observed counts and $Y$ is the distribution of actual transcript counts; the density is computed by regressing the fraction of zeros observed in each sample, $D_i$, for a given spike-in, $i$, onto the expected number spike-in molecules in the sample, $y_i$, e.g. $D=a*y+b$. Then, $D$ describes the density of zero-observations conditioned on actual transcript number, $y$, or $Prob(X=0 | Y=y)$. Notably, each gene was treated with an identical density distribution for drop-out injection.
In contrast, the density of the drop-out recovery distribution, $Prob(Y_j=y | X_j=0)$, is specific to each gene, $j$, where $X_j$ is the distribution of the observed counts and $Y_j$ is the distribution of actual transcript counts for a given gene. The gene-specific drop-out recovery distribution is inferred from drop-out injection distribution using Bayes' theorem and a prior. This is accomplished in 3 steps:
\begin{enumerate} \item For the purpose of applying Bayes' theorem, the gene-specific distribution, $Prob(X_j=0 | Y_j=y)$, is taken to be the the drop-out injection density for all genes, $j$. \item The probability that a specific transcript count is present in the sample, $Prob(Y_j=y)$, is a necessary, but empirically unknowable prior. Therefore, the prior was defined using the law of total probability, an assumption of uniformity, and the probability that a zero was observed in a given gene, $Prob(X_j=0)$. The probability, $Prob(X_j=0)$, is taken to be the fraction of observations that are zero for a given gene. BEARscc does this in order to better inform the density estimation of the gene-specific drop-out recovery distribution. \item The drop-out recovery distribution density is then computed by applying Bayes' theorem:
\begin{equation} Prob(Y_j=y | X_j=0)=\frac{Prob(X_j=0 | Y_j=y)*Prob(Y_j=y)}{Prob(X_j=0)}, \label{eq:bayesianeq}\end{equation}
\end{enumerate}
In the second step, the law of total probability, an assumption of uniformity, and the fraction of zero observations in a given gene are leveraged to define the prior, $Prob(Y_j=y)$. First, a threshold of expected number of transcripts, $k$ in $Y$, is chosen such that $k$ was the maximum value for which the drop-out injection density was non-zero. Next, uniformity is assumed for all expected number of transcript values, $y$ greater than zero and less than or equal to $k$; that is $Prob(Y_j=y)$ is defined to be some constant probability, $n$. Furthermore, $Prob(Y_j=y)$ is defined to be 0 for all $y>k$. In order to inform $Prob(Y_j=y)$ empirically, $Prob(Y_j=0)$ and $n$ are derived by imposing the law of total probability (Equation \ref{eq:totalprob}) and unity (Equation \ref{eq:unity}) yielding a system of equations:
\begin{equation} Prob(X_j=0)=\Sigma_{y=0}^{k-1}{ (Prob(X_j=0 | Y_j=y)*Prob(Y_j=y))} \label{eq:totalprob}\end{equation}
\begin{equation}
\Sigma_{y=0}^{k-1}{ Prob(Y_j=y)}=Prob(Y_j=0)+(k-1)*{n}=1
\label{eq:unity}\end{equation}
The probability that a zero is observed given there are no transcripts in the sample, $Prob(X_j=0 | Y_j=0)$, is assumed to be 1. With the preceding assumption, solving for $Prob(Y_j=0)$ and $n$ give:
\begin{equation} n=\frac{1-Prob(Y_j=0)}{k-1} \label{eq:nsolved}\end{equation}
\begin{equation} Prob(Y_j=0)=\frac{Prob(X_j=0)-\frac{1}{k-1} \Sigma_{y=1}^{k-1}( Prob(X_j=0 | Y_j=y)) }{(1-\frac{1}{k-1} \Sigma_{y=1}^{k-1} (Prob(X_j=0 | Y_j=y)) } \label{eq:py0solved}\end{equation} \
In this way, $Prob(Y_j=0)$ is defined by (Equation \ref{eq:nsolved}) for $y$ in $Y_j$ less than or equal to $k$ and greater than zero, and defined by (Equation \ref{eq:py0solved}) for $y$ in $Y_j$ equal to zero. For $y$ in $Y_j$ greater than $k$, the prior $Prob(Y_j=y)$ is defined to be equal to zero.
In the third step, the previously computed prior, $Prob(Y_j=y)$, the fraction of zero observations in a given gene, $Prob(X_j=0)$, and the drop-out injection distribution, $Prob(X_j=0 | Y_j=y)$, are utilized to estimate, with Bayes's theorem, the density of the drop-out recovery distribution, $Prob(Y_j=y | X_j=0)$. During the generation of simulated technical replicates for zero observations and count observations less than or equal to $k$, values are sampled from the drop-out recovery and injection distributions as described in the pseudocode of the BEARscc algorithm for simulating technical replicates.
Simulated technical replicates were generated from the noise mixture-model and two drop-out models. For each gene, the count value of each sample is systematically transformed using the mixture-model, $Z(c)$, and the drop-out injection, $Prob(X=0 | Y=y)$, and recovery, $Prob(Y_j=y | X_j=0)$, distributions in order to generate simulated technical replicates as indicated by the following pseudocode:
FOR EACH gene, $j$ FOR EACH count, $c$ IF $c=0$ $n \leftarrow SAMPLE$ one count, y, from $Prob(Y_j=y | X_j=0)$ IF $n=0$ $c \leftarrow 0$ ELSE $c \leftarrow SAMPLE$ one count from $Z(n)$ ENDIF ELSE IF $c\leq k$ $dropout \leftarrow TRUE$ with probability, $Prob(X=0 | Y=k)$ IF $dropout=TRUE$ $c \leftarrow 0$ ELSE $c \leftarrow SAMPLE$ one count from $Z(c)$ ENDIF ELSE $c \leftarrow SAMPLE$ one count from $Z(c)$ ENDIF ENDIF RETURN $c$ DONE DONE
\newpage
estimate_noiseparameters()
Estimates the drop-out model and technical variance from spike-ins present in the sample.
For greater detail, please see help file ?estimate_noiseparameters()
.
data(BEARscc_examples) BEAR_examples.sce <- estimate_noiseparameters(BEAR_examples.sce, alpha_resolution=0.25, write.noise.model=FALSE) BEAR_examples.sce
The resulting output of estimate_noiseparameters()
is a long list,
which is enumerated in the function's package help page.
The above usage is for execution of simulate_replicates
on a local machine. To save results as files for us of
prepare_probabilities
on high performance computing
environment, then use:
estimate_noiseparameters(BEAR_examples.sce, write.noise.model=TRUE, alpha_resolution=0.25, file="noise_estimation", model_view=c("Observed","Optimized"))
simulate_replicates()
Computes BEARscc simulated technical replicates from the
previously estimated noise parameters computed with the
function estimate_noise_parameters()
.
For greater detail, please see help file ?simulate_replicates()
.
data(analysis_examples) BEAR_examples.sce<-simulate_replicates(BEAR_examples.sce, n=3) BEAR_examples.sce
The resulting object is a list of counts data, where each element of the list
is a data.frame
of the counts representing a BEARscc simulated technical
replicate. For further details refer to the function help page.
This function is the in-package analog of the high-performance
computing function,prepare_probabilities
.
HPC_simulate_replicates()
The high-performance computing function analog to
simulate_replicates()
.
Please refere to section 2.4.
The resulting objects would normally be output to a tab-delimited file, where
each file results from a data.frame
of the counts representing a BEARscc
simulated technical replicate.
This function has no help file, but is referred to in the section 2.4 of this document on simulating for larger datasets.
compute_consensus()
Computes the consensus matrix using a data.frame
of cluster labels across
different BEARscc simulated technical replicates.The consensus matrix is
is a visual and quantitaive representation of the clustering variation on
a cell-by-cell level created by using cluster labels to compute
the number of times any given pair of cells associates in the same
cluster; this forms the 'noise consensus matrix'. Each element of
this matrix represents the fraction of simulated technical replicates
in which two cells cluster together (the 'association frequency'),
after using a clustering method of the user's choice to generate a
data.frame of clustering labels. This consensus matrix may be used
to compute BEARscc metrics at both the cluster and cell level.
For greater detail, please see help file ?compute_consensus()
.
data("analysis_examples") noise_consensus <- compute_consensus(clusters.df) noise_consensus
When the number of samples are \emph{n}, then the noise consensus resulting from this function is an \emph{n} x \emph{n} matrix describing the fraction of simulated technical replicates in which each cell of the experiment associates with another cell.
cluster_consensus()
This function will perform hierarchical clustering on the noise
consensus matrix allowing the user to investigate the appropriate
number of clusters, \emph{k}, considering the noise within the
experiment. Frequently one will want to assess multiple possible
cluster number situations at once. In this case it is recommended
that one use a lapply
in conjunction with a vector of all
biologically reasonable cluster numbers to fulfill the task of
attempting to identify the optimal cluster number.
For greater detail, please see help file ?cluster_consensus()
.
data(analysis_examples) vector <- seq(from=2, to=5, by=1) BEARscc_clusts.df <- cluster_consensus(consensus_matrix=noise_consensus, vector) BEARscc_clusts.df
The output is a vector of cluster labels based on hierarchical clustering
of the noise consensus. In the event that a vector is supplied for
number of clusters in conjunction with lapply
, then the output
is a data.frame of the cluster labels for each of the various number
of clusters deemed biologically reasonable by the user.
report_cell_metrics()
To quantitatively evaluate the results, three metrics are calculated from the noise consensus matrix: 'stability' is the average frequency with which a cell within a cluster associates with other cells within the same cluster across simulated replicates; 'promiscuity' measures the average association frequency of a cell within a cluster with the \emph{n} cells outside of the cluster with the strongest association with the cell in question; and 'score' is the difference between 'stability' and 'promiscuity'. Importantly, 'score' reflects the overall "robustness" of a given cell's assignment to a user-provided cluster label with respect to technical variance. Together these metrics provide a quantitative measure of the extent to which cluster labels provided by the user are invariant across simulated technical replicates.
For greater detail, please see help file ?report_cell_metrics()
.
data(analysis_examples) cell_scores.dt <- report_cell_metrics(BEARscc_clusts.df, noise_consensus) cell_scores.dt
A melted data.frame
describing the BEARscc metrics for each cell,
where the columns are enumerated in the help file.
report_cluster_metrics()
To quantitatively evaluate the results, three metrics are calculated from the noise consensus matrix: 'stability' is the average frequency with which cells within a cluster associate with each other across simulated replicates; 'promiscuity' measures the association frequency between cells within a cluster and those outside of it; and 'score' is the difference between 'stability' and 'promiscuity'. Importantly, 'score' reflects the overall "robustness" of a cluster to technical variance. Together these metrics provide a quantitative measure of the extent to which cluster labels provided by the user are invariant across simulated technical replicates.
For greater detail, please see help file ?report_cluster_metrics()
.
data(analysis_examples) cluster_scores.dt <- report_cluster_metrics(BEARscc_clusts.df,noise_consensus, plot=TRUE, file="example") cluster_scores.dt
A melted data.frame
describing the BEARscc metrics for each cluster,
where the columns are enumerated in the help file.
\newpage
Within the package there are data subsampled from single cell sequencing protocol applied to water samples containing ERCC spike-ins (blanks) and dilute RNA from brain whole tissue (brain) discussed at length in in a manuscript on bioRxiv
BEARscc_examples
A toy dataset for applying BEARscc functions as described in the README on https://bitbucket.org/bsblabludwig/bearscc.git.These data are a subset of observations made by Drs. Michael White and Richard Owen in the Xin Lu Lab. Samples were sequenced by the Wellcome Trust Center for Genomics, Oxford, UK. These data are available in full with GEO accession number, GSE95155.
For greater detail, please see help file ?BEARscc_examples
.
data("BEARscc_examples")
analysis_examples
BEARscc downstream example objects: The analysis_examples
Rdata object
contains downstream data objects for use in various help pages for
dynamic execution resulting from running tutorial in README and
vignette on BEARscc_examples
. The objects are a result of applying
BEARscc functions as described in the README found at
https://bitbucket.org/bsblabludwig/bearscc.git.
For greater detail, please see help file ?analysis_examples
.
data("analysis_examples")
\newpage
This software is made available under the terms of the GNU General Public License v3
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE, TITLE AND NON-INFRINGEMENT. IN NO EVENT SHALL THE COPYRIGHT HOLDERS OR ANYONE DISTRIBUTING THE SOFTWARE BE LIABLE FOR ANY DAMAGES OR OTHER LIABILITY, WHETHER IN CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
Any scripts or data that you put into this service are public.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.