knitr::opts_chunk$set( echo = TRUE, message = FALSE, warning = FALSE )
The r BiocStyle::Biocpkg("speckle")
package contains functions to analyse
differences in cell type proportions in single cell RNA-seq data. As our
research into specialised analyses of single cell data continues we anticipate
that the package will be updated with new functions.
The analysis of single cell RNA-seq data consists of a large number of steps, which can be iterative and also depend on the research question. There are many R packages that can do some or most of these steps. The analysis steps are described here briefly.
Once the sequencing data has been summarised into counts over genes, quality control is performed to remove poor quality cells. Poor quality cells are often characterised as having very low total counts (library size) and very few genes detected. Lowly expressed and uninformative genes are filtered out, followed by appropriate normalisation. Dimensionality reduction and clustering of the cells is then performed. Cells that have similar transcriptional profiles cluster together, and these clusters (hopefully) correspond to something biologically relevant, such as different cell types. Differential expression between each cluster compared to all other clusters can highlight genes that are more highly expressed in each cluster. These marker genes help to determine the cell type each cluster corresponds to. Cell type identification is a process that often uses marker genes as well as a list of curated genes that are known to be expressed in each cell type. It is always helpful to visualise the data in a lot of different ways to aid in interpretation of the clusters using tSNE/UMAP plots, clustering trees and heatmaps of known marker genes.
library(devtools) install_github("/Oshlack/speckle")
In order to determine whether there are statistically significant compositional
differences between groups, there must be some form of biological replication in
the experiment. This is so that we can estimate the variability of the cell type
proportion estimates for each group. A classical statistical test for
differences between two proportions is typically very sensitive to small changes
and will almost always yield a significant p-value. Hence propeller
is only
suitable to use in single cell experiments where there are multiple groups and
multiple biological replicates in at least one of the groups. The absolute
minimum sample size is 2 in one group and 1 in the other group/s. Variance
estimates are obtained from the group with more than 1 biological replicate
which assumes that the cell type proportion variances estimates are similar
between experimental conditions.
The propeller
test is performed after initial analysis of the single cell data
has been done, i.e. after clustering and cell type assignment. The propeller
function can take a SingleCellExperiment
or Seurat
object and extract the
necessary information from the metadata. The basic model for propeller
is that
the cell type proportions for each sample are estimated based on the clustering
information provided by the user or extracted from the relevant slots in the
data objects. The proportions are then transformed using either an arcsin square
root transformation, or logit transformation. For
each cell type $i$, we fit a linear model with group as the explanatory variable
using functions from the R Bioconductor package r BiocStyle::Biocpkg("limma")
.
Using r BiocStyle::Biocpkg("limma")
to obtain p-values has the added benefit
of performing empirical Bayes shrinkage of the variances. For every cell type
we obtain a p-value that indicates whether that cell type proportion is
statistically significantly different between two (or more) groups.
library(speckle) library(SingleCellExperiment) library(CellBench) library(limma) library(ggplot2) library(scater) library(patchwork) library(edgeR) library(AnnotationDbi) library(org.Hs.eg.db)
We are using single cell data from the r BiocStyle::Biocpkg("CellBench")
package to illustrate how propeller
works. This is an artificial dataset that
is made up of an equal mixture of 3 different cell lines. There are three
datasets corresponding to three different technologies: 10x genomics, CelSeq and
DropSeq.
sc_data <- load_sc_data()
The way that propeller
is designed to be used is in the context of a designed
experiment where there are multiple biological replicates and multiple groups.
Comparing cell type proportions without biological replication should be done
with caution as there will be a large degree of variability in the cell type
proportions between samples due to technical factors (cell capture bias,
sampling, clustering errors), as well as biological variability. The
r BiocStyle::Biocpkg("CellBench")
dataset does not have biological
replication, so we will create several artificial biological replicates by
bootstrapping the data. Bootstrapping has the advantage that it induces
variability between bootstrap samples by sampling with replacement. Here we will
treat the three technologies as the groups, and create artifical biological
replicates within each group. Note that bootstrapping only induces sampling
variability between our biological replicates, which will almost certainly be
much smaller than biological variability we would expect to see in a real
dataset.
The three single cell experiment objects in sc_data
all have differing numbers
of genes. The first step is to find all the common genes between all three
experiments in order to create one large dataset.
commongenes1 <- rownames(sc_data$sc_dropseq)[rownames(sc_data$sc_dropseq) %in% rownames(sc_data$sc_celseq)] commongenes2 <- commongenes1[commongenes1 %in% rownames(sc_data$sc_10x)] sce_10x <- sc_data$sc_10x[commongenes2,] sce_celseq <- sc_data$sc_celseq[commongenes2,] sce_dropseq <- sc_data$sc_dropseq[commongenes2,] dim(sce_10x) dim(sce_celseq) dim(sce_dropseq) table(rownames(sce_10x) == rownames(sce_celseq)) table(rownames(sce_10x) == rownames(sce_dropseq))
This dataset does not have any biological replicates, so we will bootstrap
additional samples and pretend that they are biological replicates.
Bootstrapping won't replicate true biological variation between samples, but we
will ignore that for the purpose of demonstrating how propeller
works. Note
that we don't need to simulate gene expression measurements; propeller
only
uses cluster information, hence we simply bootstrap the column indices of the
single cell count matrices.
i.10x <- seq_len(ncol(sce_10x)) i.celseq <- seq_len(ncol(sce_celseq)) i.dropseq <- seq_len(ncol(sce_dropseq)) set.seed(10) boot.10x <- sample(i.10x, replace=TRUE) boot.celseq <- sample(i.celseq, replace=TRUE) boot.dropseq <- sample(i.dropseq, replace=TRUE) sce_10x_rep2 <- sce_10x[,boot.10x] sce_celseq_rep2 <- sce_celseq[,boot.celseq] sce_dropseq_rep2 <- sce_dropseq[,boot.dropseq]
The SingleCellExperiment
objects don't combine very easily, so I will create a
new object manually, and retain only the information needed to run propeller
.
sample <- rep(c("S1","S2","S3","S4","S5","S6"), c(ncol(sce_10x),ncol(sce_10x_rep2),ncol(sce_celseq), ncol(sce_celseq_rep2), ncol(sce_dropseq),ncol(sce_dropseq_rep2))) cluster <- c(sce_10x$cell_line,sce_10x_rep2$cell_line,sce_celseq$cell_line, sce_celseq_rep2$cell_line,sce_dropseq$cell_line, sce_dropseq_rep2$cell_line) group <- rep(c("10x","celseq","dropseq"), c(2*ncol(sce_10x),2*ncol(sce_celseq),2*ncol(sce_dropseq))) allcounts <- cbind(counts(sce_10x),counts(sce_10x_rep2), counts(sce_celseq), counts(sce_celseq_rep2), counts(sce_dropseq), counts(sce_dropseq_rep2)) sce_all <- SingleCellExperiment(assays = list(counts = allcounts)) sce_all$sample <- sample sce_all$group <- group sce_all$cluster <- cluster
Here I am going to use the Bioconductor package r BiocStyle::Biocpkg("scater")
to visualise the data. The r BiocStyle::Biocpkg("scater")
vignette goes
quite deeply into quality
control of the cells and the kinds of QC plots we like to look at. Here we will
simply log-normalise the gene expression counts, perform dimensionality
reduction (PCA) and generate PCA/TSNE/UMAP plots to visualise the relationships
between the cells.
sce_all <- scater::logNormCounts(sce_all) sce_all <- scater::runPCA(sce_all) sce_all <- scater::runUMAP(sce_all)
Plot PC1 vs PC2 colouring by cell line and technology:
pca1 <- scater::plotReducedDim(sce_all, dimred = "PCA", colour_by = "cluster") + ggtitle("Cell line") pca2 <- scater::plotReducedDim(sce_all, dimred = "PCA", colour_by = "group") + ggtitle("Technology") pca1 + pca2
Plot UMAP highlighting cell line and technology:
umap1 <- scater::plotReducedDim(sce_all, dimred = "UMAP", colour_by = "cluster") + ggtitle("Cell line") umap2 <- scater::plotReducedDim(sce_all, dimred = "UMAP", colour_by = "group") + ggtitle("Technology") umap1 + umap2
For this dataset UMAP is a little bit of an overkill, the PCA plots show the relationships between the cells quite well. PC1 separates cells based on technology, and PC2 separates cells based on the cell line (clusters). From the PCA plots we can see that 10x is quite different to CelSeq and DropSeq, and the H2228 cell line is quite different to the remaining 2 cell lines.
In order to demonstrate propeller
I will assume that the cell line information
corresponds to clusters and all the analysis steps have beeen performed. Here we
are interested in testing whether there are compositional differences between
the three technologies: 10x, CelSeq and DropSeq. Since there are more than 2
groups, propeller
will perform an ANOVA to determine whether there is a
significant shift in the cell type proportions between these three groups.
The propeller
function can take a SingleCellExperiment
object or Seurat
object as input and extract the three necessary pieces of information from the
cell information stored in colData
. The three essential pieces of information
are
If these arguments are not explicitly passed to the propeller
function, then
these are extracted from the SingleCellExperiment
or Seurat
object. Upper
or lower case is acceptable, but
the variables need to be named exactly as stated in the list above. For a
Seurat
object, the cluster information is extracted from Idents(x)
.
The default of propeller is to perform the logit transformation:
# Perform logit transformation propeller(sce_all)
An alternative variance stabilising transformation is the arcsin square root transformation.
# Perform arcsin square root transformation propeller(sce_all, transform="asin")
The results from using the two different transforms are a little bit different, with the H1975 cell line being statistically significant using the arc sin square root transform, and not significant after using the logit transform.
Another option for running propeller
is for the user to supply the cluster,
sample and group information explicitly to the propeller
function.
propeller(clusters=sce_all$cluster, sample=sce_all$sample, group=sce_all$group)
The cell lines were mixed together in roughly equal proportions (~0.33) and hence we don't expect to see significant differences between the three clusters. However, because bootstrapping the samples doesn't incorporate enough variability between the samples to mimic true biological variability, we can see that the H1975 cluster looks significantly different between the three technologies. The proportion of this cell line is closer to 0.4 for CelSeq and DropSeq, and 0.34 for the 10x data.
In the r BiocStyle::Biocpkg("speckle")
package there is a plotting function
plotCellTypeProps
that takes a Seurat
or SingleCellExperiment
object,
extracts sample and cluster information and outputs a barplot of cell type
proportions between the samples. The user also has the option of supplying the
cluster and sample cell information instead of an R object. The output is a
ggplot2
object that the user can then manipulate however they please.
plotCellTypeProps(sce_all)
Alternatively, we can obtain the cell type proportions and transformed
proportions directly by running the getTransformedProps
function which takes
the cluster and sample information as input. The output from
getTransformedProps
is a list with the cell type counts, transformed
proportions and proportions as elements.
props <- getTransformedProps(sce_all$cluster, sce_all$sample, transform="logit") barplot(props$Proportions, col = c("orange","purple","dark green"),legend=TRUE, ylab="Proportions")
Call me old-school, but I still like looking at stripcharts to visualise results and see whether the significant p-values make sense.
par(mfrow=c(1,3)) for(i in 1:3){ stripchart(props$Proportions[i,]~rep(c("10x","celseq","dropseq"),each=2), vertical=TRUE, pch=16, method="jitter", col = c("orange","purple","dark green"),cex=2, ylab="Proportions") title(rownames(props$Proportions)[i]) }
If you are interested in seeing which models best fit the data in terms of the
cell type variances, there are two plotting functions that can do this:
plotCellTypeMeanVar
and plotCellTypePropsMeanVar
. For this particular
dataset it isn't very informative because there are only three cell "types"
and no biogical variability.
par(mfrow=c(1,1)) plotCellTypeMeanVar(props$Counts) plotCellTypePropsMeanVar(props$Counts)
If you are like me, you won't feel very comfortable with a black-box approach
where one function simply spits out a table of results. If you would like to
have more control over your linear model and include extra covariates then you
can fit a linear model in a more direct way using the transformed proportions
that can be obtained by running the getTransformedProps
function.
We have already obtained the proportions and transformed proportions when we ran
the getTransformedProps
function above. This function outputs a list object
with three elements: Counts
, TransformedProps
and Proportions
. These are
all matrices with clusters/cell types in the rows and samples in the columns.
names(props) props$TransformedProps
First we need to set up our sample information in much the same way we would if we were analysing bulk RNA-seq data. We can pretend that we have pairing information which corresponds to our original vs bootstrapped samples to make our model a little more complicated for demonstration purposes.
group <- rep(c("10x","celseq","dropseq"),each=2) pair <- rep(c(1,2),3) data.frame(group,pair)
We can set up a design matrix taking into account group and pairing information.
Please note that the way that propeller
has been designed is such that the
group information is always first in the design matrix specification, and
there is NO intercept. If you are new to design matrices and linear modelling, I
would highly recommend reading the r BiocStyle::Biocpkg("limma")
manual, which
is incredibly extensive and covers a variety of different experimental set ups.
design <- model.matrix(~ 0 + group + pair) design
In our example, we have three groups, 10x, CelSeq and DropSeq. In order to
perform an ANOVA to test for cell type composition differences between these
3 technologies, we can use the propeller.anova
function. The coef
argument
tells the function which columns of the design matrix correspond to the groups
we are interested in testing. Here we are interested in the first three columns.
propeller.anova(prop.list=props, design=design, coef = c(1,2,3), robust=TRUE, trend=FALSE, sort=TRUE)
Note that the p-values are smaller here than before because we have specified a pairing vector that states which samples were bootstrapped and which are the original samples.
If we were interested in testing only 10x versus DropSeq we could alternatively
use the propeller.ttest
function and specify a contrast that tests for this
comparison with our design matrix.
design mycontr <- makeContrasts(group10x-groupdropseq, levels=design) propeller.ttest(props, design, contrasts = mycontr, robust=TRUE, trend=FALSE, sort=TRUE)
Finally note that the robust
and trend
parameters are parameters for the
eBayes
function in r BiocStyle::Biocpkg("limma")
. When robust = TRUE
,
robust empirical Bayes shrinkage of the variances is performed which mitigates
the effects of outlying observations. We set trend = FALSE
as we don't expect
a mean-variance trend after performing our variance stabilising transformation.
There may also be an error when trend
is set to TRUE because there are often
not enough data points to estimate the trend.
Let us assume that we expect that the different technologies have a meaningful
ordering to them, and we would like to find the cell types that are increasing
or decreasing along this trend. In more complex scenarios beyond group
comparisons I would recommend taking the transformed proportions from the
getTransformedProps
function and using the linear model fitting functions from
the r BiocStyle::Biocpkg("limma")
package directly.
Let us assume that the ordering of the technologies is 10x -> celseq -> dropseq. Then we can recode them 1, 2, 3 and treat the technologies as a continuous variable. Obviously this scenario doesn't make much sense biologically, but we will continue for demonstration purposes.
group dose <- rep(c(1,2,3), each=2) des.dose <- model.matrix(~dose) des.dose fit <- lmFit(props$TransformedProps,des.dose) fit <- eBayes(fit, robust=TRUE) topTable(fit)
Here the log fold changes are reported on the transformed data, so they are not as easy to interpret directly. The positive logFC indicates that the cell type proportions are increasing (for example for H1975), and a negative logFC indicates that the proportions are decreasing across the ordered technologies 10x -> celseq -> dropseq.
You can get the estimates from the model on the proportions directly by fitting
the same model to the proportions. Here the logFC
is the slope of the trend
line on the proportions, and the AveExpr
is the average of the proportions
across all technologies.
fit.prop <- lmFit(props$Proportions,des.dose) fit.prop <- eBayes(fit.prop, robust=TRUE) topTable(fit.prop)
You could plot the continuous variable dose
against the proportions and add
trend lines, for example.
fit.prop$coefficients par(mfrow=c(1,3)) for(i in 1:3){ plot(dose, props$Proportions[i,], main = rownames(props$Proportions)[i], pch=16, cex=2, ylab="Proportions", cex.lab=1.5, cex.axis=1.5, cex.main=2) abline(a=fit.prop$coefficients[i,1], b=fit.prop$coefficients[i,2], col=4, lwd=2) }
What I recommend in this instance is using the p-values from the analysis on the transformed data, and the reported statistics (i.e. the coefficients from the model) obtained from the analysis on the proportions for visualisation purposes.
If you have a random effect that you would like to account for in your analysis,
for example repeated measures on the same individual, then you can use the
duplicateCorrelation
function from the r BiocStyle::Biocpkg("limma")
.
For illustration purposes, let us assume that pair
indicates samples taken
from the same individual (or they could represent technical replicates), and we
would like to account for this in our analysis
using a random effect. Again, we fit our models on the transformed proportions
in order to obtain the p-values.
We will formulate the design matrix with an intercept for this example, and test
the differences in technologies relative to 10x. The block
parameter will be
the pair
variable. Note that the design matrix now does not include pair
as
a fixed effect.
des.tech <- model.matrix(~group) dupcor <- duplicateCorrelation(props$TransformedProps, design=des.tech, block=pair) dupcor
The consensus correlation is quite high (r dupcor$consensus.correlation
),
which we expect because we bootstrapped these additional samples.
# Fitting the linear model accounting for pair as a random effect fit1 <- lmFit(props$TransformedProps, design=des.tech, block=pair, correlation=dupcor$consensus) fit1 <- eBayes(fit1) summary(decideTests(fit1)) # Differences between celseq vs 10x topTable(fit1,coef=2) # Differences between dropseq vs 10x topTable(fit1, coef=3)
For celseq vs 10x, H1975 and H2228 are significantly different, with a greater proportion of H1975 cells detected in celseq, and fewer H2228 cells. For dropseq vs 10x, there is a higher proportion of H1975 cells.
If you want to do an ANOVA between the three groups:
topTable(fit1, coef=2:3)
Generally, you can perform any analysis on the transformed proportions that you
would normally do when using limma (i.e. on roughly normally distributed data).
For more complex random effects models with 2 or more random effects, you can
use the `r BiocStyle::Biocpkg("lme4")
package.
The experimental groups are likely to contribute large sources of variation in
the data. In the r BiocStyle::Biocpkg("CellBench")
data the technology effect
is larger than the cell line effect. In order to cluster the data to produce
meaningful cell types that will then feed into meaningful tests for proportions,
the cell types should be represented in as many samples as possible. Users
should consider using integration techniques on their single cell data
prior to clustering, integrating on biological sample or perhaps experimental
group. See methods such as Harmony, Liger and Seurat's integration technique
for more information.
Cell type label assignment should not be too refined such that every sample has
many unique cell types. The propeller
function can handle proportions of 0 and
1 without breaking, but it is not very meaningful if every cell type difference
is statistically significant. Consider testing cell type categories that are
broader for more meaningful results, perhaps by combining clusters that are
highly similar. The refined clusters can always be explored in terms of gene
expression differences later on.
It may be appropriate to perform cell type assignment using classification methods rather than clustering. This allows the user to classify cells into known cell types, but you may run the risk of missing novel cell types. A combination of approaches may be necessary depending on the dataset. Good luck. The more heterogenous the dataset, the more tricky this becomes.
Another approach is to model the counts directly using statistical models that
can take into account biological variability, such as negative binomial.
You can read the
OSCA bookchapter
on how to use the r BiocStyle::Biocpkg("edgeR")
package to do this.
A common quality control check in bulk RNA-seq is to check the sex of the samples. The simplest method to do this is to check expression of XIST, which is the gene responsible for X-inactivation. It is highly expressed in females, and not expressed in males. In experiments where the sex of the samples has not been recorded, the variation due to sex can often be captured by the top principal component in an MDS or PCA plot. It is important to know if the samples are a mix of males and females and to take this information into account in downstream analysis.
It turns out that for single cell data, it is not a simple matter to classify cells as male or female. The main reason for this is that the cells are much more lowly sequenced compared to bulk RNA-seq samples resulting in low or zero counts for many genes, including XIST and other X- and Y-chromosome genes. Thus simply trying to classify cells as male or female based on observed counts for XIST leave many cells unable to be classified.
There are a few reasons for wanting to classify male and female cells. First, it could form part of the analysis assessing quality of the cells, and if sex is not information that is easily available, it is an additional variable we can predict from the gene expression. This may then inform further analysis of the data, by allowing us to take sex into account in the analysis.
We have built a classifier to predict the sex of each cell based on logistic
regression for human and mouse cells. The input is the matrix of counts where
the genes are the rows and the cells are the columns. The rownames of the counts
matrix needs to be gene symbol. The allcounts
data object contains the counts
for all the cells used in the propeller
function, but the rownames are ENSEMBL
IDs. The first step is thus converting the ENSEMBL IDs to gene symbol.
allcounts[1:5,1:5] nc <- normCounts(allcounts, log=TRUE) avgexp <- rowMeans(nc) o <- order(avgexp, decreasing = TRUE) allcounts2 <- allcounts[o,] allcounts2[1:5,1:5] ann <- AnnotationDbi::select(org.Hs.eg.db, keys=rownames(allcounts2), columns=c("ENSEMBL","SYMBOL"), keytype="ENSEMBL") m <- match(rownames(allcounts2), ann$ENSEMBL) symbol <- ann$SYMBOL[m] table(duplicated(symbol)) allcounts2 <- allcounts2[!duplicated(symbol) & !is.na(symbol),] rownames(allcounts2) <- symbol[!duplicated(symbol) & !is.na(symbol)] table(duplicated(colnames(allcounts2))) colnames(allcounts2) <- paste(colnames(allcounts2),1:ncol(allcounts2), sep=".") table(duplicated(rownames(allcounts2)))
Now that the counts matrix is in the correct format we can call the
classifySex
function.
sex <- classifySex(allcounts2, genome="Hs", qc=FALSE) table(sex$prediction)
The cell lines were all derived from females, so the sex of the cells is correctly classified as female.
sessionInfo()
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.