## Options for Rmarkdown compilation knitr::opts_chunk$set( fig.width = 7, fig.height = 5, fig.align = "center", collapse = TRUE, crop = NULL ) ## Time the compilation timeStart <- Sys.time()
In this vignette, we will analyse the leduc2022_pSCoPE
dataset
using the scplainer approach. The
data were acquired using the nPOP acquisition protocole.
nPOP (Leduc et al. 2022)
is an upgrade of the SCoPE2 protocol (Specht et al. 2021
and Petelski et al. 2021),
where the mPOP sample preparation method is replaced by the nPOP method. nPOP processes
samples using the Cellenion dispensing device and uses DMSO as lysis
reagent instead of a freeze-thaw procedure. They also include the
prioritised data acquisition mode as described by
Huffman et al. 2022.
Single cells were labelled with TMT-18 and MS data were acquired in
prioritised DDA mode.
We rely on several packages to compile this vignette.
## Core packages library("scp") library("scpdata") ## Utility packages library("ggplot2") library("patchwork") library("ensembldb") library("EnsDb.Hsapiens.v86") library("dplyr") library("bluster") library("scater")
The data set is available from the scpdata
package.
leduc <- leduc2022_pSCoPE()
The data set primarily consists of melanoma cells and monocytes. The
data set also contains carrier samples, negative control samples,
reference samples and empty wells (Unused
).
table(leduc$SampleType)
The data were acquired using TMT-18 labelling.
levels(leduc$Channel)
The data were acquired as part of 134 MS acquisition batches.
length(unique(leduc$Set)) cat(head(leduc$Set), "...", tail(leduc$Set))
Finally, samples were prepared with the nPOP protocol using 2 glass
slides. The information is stored under lcbatch
.
table(leduc$lcbatch)
The minimal data processing workflow in scplainer consists of 5 main steps:
We remove the assays that were processed by the authors. The vignette only uses the data generated by Maxquant.
assaysToRemove <- c("peptides", "peptides_log", "proteins_norm2", "proteins_processed") leduc <- removeAssay(leduc, assaysToRemove)
We remove feature annotations that won't be used in the remainder of the vignette. This is to avoid overcrowding of the annotation tables later in the vignette.
requiredRowData <- c( "Sequence", "Leading.razor.protein.symbol", "Leading.razor.protein.id", "Reverse", "Potential.contaminant", "Leading.razor.protein", "PIF", "dart_qval" ) leduc <- selectRowData(leduc, requiredRowData)
We replace zeros by missing values. A zero may be a true (the feature
is not present in the sample) or because of technical limitations (due
to the technology or the computational pre-processing). Because we are
not able to distinguish between the two, zeros should be replaced with
NA
.
leduc <- zeroIsNA(leduc, i = names(leduc))
We remove low-quality PSMs that may propagate technical artefacts and bias data modelling. The quality control criteria are:
All the criteria were readily computed except for the single cell to
carrier ratio. We compute this using computeSCR()
. The results are
stored in the rowData
.
leduc <- computeSCR( leduc, names(leduc), colvar = "SampleType", samplePattern = "Mel|Macro", carrierPattern = "Carrier", sampleFUN = "mean", rowDataName = "MeanSCR" )
Here is an overview of the distributions of each criteria
df <- data.frame(rbindRowData(leduc, names(leduc))) df$ContaminantOrReverse <- df$Reverse == "+" & df$Potential.contaminant == "+" & grepl("REV|CON", df$Leading.razor.protein) ggplot(df) + aes(x = ContaminantOrReverse) + geom_bar() + ## PIF plot ggplot(df) + aes(x = PIF) + geom_histogram() + ## q-value plot ggplot(df) + aes(x = log10(dart_qval)) + geom_histogram() + ## mean SCR plot ggplot(df) + aes(x = log10(MeanSCR)) + geom_histogram()
We filter and keep the features that pass the quality control criteria.
leduc <- filterFeatures( leduc, ~ Reverse != "+" & Potential.contaminant != "+" & !grepl("REV|CON", Leading.razor.protein) & !is.na(PIF) & PIF > 0.6 & dart_qval < 0.01 & !is.na(MeanSCR) & MeanSCR < 0.05 )
Similarly to the features, we also remove low-quality cells. The quality control criteria are:
leduc <- countUniqueFeatures( leduc, i = names(leduc), groupBy = "Sequence", colDataName = "NumberPeptides" )
MedianIntensity <- lapply(experiments(leduc), function(x) { out <- colMedians(log(assay(x)), na.rm = TRUE) names(out) <- colnames(x) out }) names(MedianIntensity) <- NULL MedianIntensity <- unlist(MedianIntensity) colData(leduc)[names(MedianIntensity), "MedianIntensity"] <- MedianIntensity
leduc <- medianCVperCell( leduc, i = names(leduc), groupBy = "Leading.razor.protein.symbol", nobs = 3, na.rm = TRUE, colDataName = "MedianCV", norm = "SCoPE2" )
We plot the metrics used to perform sample quality control.
ggplot(data.frame(colData(leduc))) + aes( y = MedianIntensity, x = NumberPeptides, color = MedianCV, shape = SampleType ) + geom_point(size = 2) + scale_color_continuous(type = "viridis")
We apply the filter and keep only single cells that pass the quality control.
passQC <- !is.na(leduc$MedianCV) & leduc$MedianCV < 0.6 & leduc$MedianIntensity > 6 & leduc$MedianIntensity < 8 & leduc$NumberPeptides > 750 & grepl("Mono|Mel", leduc$SampleType) leduc <- subsetByColData(leduc, passQC)
For now, each MS acquisition run is stored separately in an assay. We here combine these assays in one. The issue is that PSMs are specific to each run. We therefore aggregate the PSMs into peptides.
peptideAssays <- paste0("peptides_", names(leduc)) leduc <- aggregateFeatures( leduc, i = names(leduc), fcol = "Sequence", name = peptideAssays, fun = colMedians, na.rm = TRUE )
Next to that, we must adapt the peptide to protein mapping. When
joining all assays, we will only keep the feature annotations that are
constant within each peptide. However, some peptide sequences map to one protein in
one run and to another protein in another run. Hence, the protein
name is not constant for all peptides and is removed during joining.
It is important we keep the protein sequence in the rowData
, in case
we later want to aggregate, model or infer protein level
quantification.
ppMap <- rbindRowData(leduc, i = grep("^pep", names(leduc))) %>% data.frame %>% group_by(Sequence) %>% ## The majority vote happens here mutate(Leading.razor.protein.symbol = names(sort(table(Leading.razor.protein.symbol), decreasing = TRUE))[1], Leading.razor.protein.id = names(sort(table(Leading.razor.protein.id), decreasing = TRUE))[1]) %>% dplyr::select(Sequence, Leading.razor.protein.symbol, Leading.razor.protein.id) %>% dplyr::filter(!duplicated(Sequence, Leading.razor.protein.symbol)) consensus <- lapply(peptideAssays, function(i) { ind <- match(rowData(leduc)[[i]]$Sequence, ppMap$Sequence) DataFrame(Leading.razor.protein.symbol = ppMap$Leading.razor.protein.symbol[ind], Leading.razor.protein.id = ppMap$Leading.razor.protein.id[ind]) }) names(consensus) <- peptideAssays rowData(leduc) <- consensus
The data can now be joined.
leduc <- joinAssays(leduc, i = peptideAssays, name = "peptides")
Finally, we also convert Uniprot protein identifiers to gene symbols.
proteinIds <- rowData(leduc)[["peptides"]]$Leading.razor.protein.id proteinConversionDf <- transcripts( EnsDb.Hsapiens.v86, columns = "gene_name", return.type = "data.frame", filter = UniprotFilter(proteinIds) ) matchedIndex <- match(proteinIds, proteinConversionDf$uniprot_id) geneName <- proteinConversionDf$gene_name[matchedIndex] rowData(leduc)[["peptides"]]$gene <- geneName
We log2-transform the quantification data.
leduc <- logTransform(leduc, i = "peptides", name = "peptides_log")
Here is an overview of the data processing:
plot(leduc)
Model the data using scplainer, the linear regression model implemented in scp
.
scplainer is applied on a SingleCellExperiment
so we extract it from
the processed data set along with the colData
sce <- getWithColData(leduc, "peptides_log")
First, we must specify which variables to include in the model. We here include 4 variables:
MedianIntensity
: this is the normalisation factor used to correct
for cell-specific technical differences.Channel
: this is used to correct for TMT effects.Set
: this is used to perform batch correction. We consider each
acquisition run to be a batch. SampleType
: this is the biological variable of interest. It
captures the difference between macrophages and monocytesscpModelWorkflow()
fits linear regression models to the data, where the
model is adapted for each peptide depending on its pattern of missing
values.
sce <- scpModelWorkflow( sce, formula = ~ 1 + ## intercept ## normalisation MedianIntensity + ## batch effects Channel + Set + ## biological variability SampleType )
Once the models are fit, we can explore the distribution of the n/p ratios.
scpModelFilterThreshold(sce) <- 3 scpModelFilterPlot(sce)
Many peptides do not have sufficient observations to estimate the model. We have chosen to continue the analysis with peptides that have $n/p >= 3$. You could consider $n/p$ a rough average of the number of replicates per parameter to fit (for categorical variables, the number of replicates per group). We recommend moving the threshold away from 1 to increase statistical power and remove noisy peptides. This comes of course at the cost of less peptides included in the analysis.
The model analysis consists of three steps:
The analysis of variance explores the proportion of data captures by each variable in the model.
(vaRes <- scpVarianceAnalysis(sce)) vaRes[[1]]
The results are a list of tables, one table for each variable. Each
table reports for each peptide the variance captures (SS
), the
residual degrees of freedom for estimating the variance (df
) and the
percentage of total variance explained (percentExplainedVar
). To
better explore the results, we add the annotations available in the
rowData
.
vaRes <- scpAnnotateResults( vaRes, rowData(sce), by = "feature", by2 = "Sequence" )
By default, we explore the variance for all peptides combined:
scpVariancePlot(vaRes)
We explore the top 20 peptides that have the highest percentage of variance explained by the biological variable (top) or by the batch variable (bottom).
scpVariancePlot( vaRes, top = 20, by = "percentExplainedVar", effect = "SampleType", decreasing = TRUE, combined = FALSE ) + scpVariancePlot( vaRes, top = 20, by = "percentExplainedVar", effect = "Set", decreasing = TRUE, combined = FALSE ) + plot_layout(ncol = 1, guides = "collect")
We can also group these peptide according to their protein. We simply
need to provide the fcol
argument.
scpVariancePlot( vaRes, top = 20, by = "percentExplainedVar", effect = "SampleType", decreasing = TRUE, combined = FALSE, fcol = "gene" ) + scpVariancePlot( vaRes, top = 20, by = "percentExplainedVar", effect = "Set", decreasing = TRUE, combined = FALSE, fcol = "gene" ) + plot_layout(ncol = 1, guides = "collect")
Next, we explore the model output to understand the differences
between melanoma cells and monocytes. The difference of interest is
specified using the contrast
argument. The first element points to
the variable to test and the two following element are the groups of
interest to compare. You can provide multiple contrast in a list.
(daRes <- scpDifferentialAnalysis( sce, contrast = list(c("SampleType", "Melanoma", "Monocyte")) )) daRes[[1]]
Similarly to analysis of variance, the results are a list of tables, one
table for each contrast. Each table reports for each peptide the
estimated difference between the two groups, the standard error
associated to the estimation, the degrees of freedom, the
t-statistics, the associated p-value and the p-value FDR-adjusted for
multiple testing across all peptides. Again, to better explore the
results, we add the annotations available in the rowData
.
daRes <- scpAnnotateResults( daRes, rowData(sce), by = "feature", by2 = "Sequence" )
We then visualize the results using a volcano plot. The function below return a volcano plot for each contrast.
scpVolcanoPlot(daRes)
To help interpretation of the results, we will label the peptides with their gene name. Also we increase the number of labels shown on the plot. Finally, we can add colors to the plot. For instance, let's explore the impact of the number of observations using the $n/p$ ratio. We create a new annotation table, add it to the results and redraw the plot. The $n/p$ ratio is retrieved using scpModelFilterNPRatio
np <- scpModelFilterNPRatio(sce) daRes <- scpAnnotateResults( daRes, data.frame(feature = names(np), npRatio = np), by = "feature" ) scpVolcanoPlot( daRes, top = 30, textBy = "gene", pointParams = list(aes(colour = npRatio)) )
As expected, higher number of observations (higher $n/p$) lead to increased statistical power and hence to more significant results.
Proteins such as VIM, LGALS3, CALU, LMNA, CTTN, are more abundant in melanoma cells compared to monocytes. On the other hand, proteins such as LCP1, CORO1A, ARHGDIB, TMSB4X are more abundant in monocytes compared to melanoma cells.
Finally, we offer functionality to report results at the protein level.
scpDifferentialAggregate(daRes, fcol = "gene") |> scpVolcanoPlot(top = 30, textBy = "gene")
Finally, we perform component analysis to link the modelled effects to the cellular heterogeneity. We here run an APCA+ (extended ANOVA-simultaneous principal component analysis) for the sample type effect. In other words, we perform a PCA on the data that is capture by the sample type variable along with the residuals (unmodelled data).
(caRes <- scpComponentAnalysis( sce, ncomp = 2, method = "APCA", effect = "SampleType" ))
The results are contained in a list with 2 elements. bySample
contains the PC scores, that is the component results in sample space.
byFeature
contains the eigenvectors, that is the component results
in feature space.
caRes$bySample
Each of the two elements contains components results
for the data before modelling (unmodelled
), for the residuals or for
the APCA on the sample type variable (APCA_SampleType
).
caRes$bySample[[1]]
Each elements is a table with the computed componoents. Let's explore the component analysis results in cell space. Similarly to the previous explorations, we annotate the results.
caResCells <- caRes$bySample sce$cell <- colnames(sce) caResCells <- scpAnnotateResults(caResCells, colData(sce), by = "cell")
We then generate the component plot, colouring by SampleType
. To
assess the impact of batch effects, we shape the points according to
the plate batch (cf intro) as well.
scpComponentPlot( caResCells, pointParams = list(aes(colour = SampleType, shape = lcbatch)) ) |> wrap_plots() + plot_layout(guides = "collect")
While the data before modelling is mainly driven by batch effects, the APCA clearly separates the two cell populations. Interestingly, the PCA on the residuals suggests that there is a small subpopulation that we did not model. We will explore this later during downstream analysis.
We use the same approach to explore the component results in feature space.
caResPeps <- caRes$byFeature caResPeps <- scpAnnotateResults( caResPeps, rowData(sce), by = "feature", by2 = "Sequence" )
We plot the compenents in peptide-space.
plCApeps <- scpComponentPlot( caResPeps, pointParams = list(size = 0.8, alpha = 0.4) ) wrap_plots(plCApeps)
We can also combine the exploration of the components in cell and peptide space. This is possible thanks to biplots.
biplots <- scpComponentBiplot( caResCells, caResPeps, pointParams = list(aes(colour = SampleType, shape = lcbatch)), labelParams = list(size = 1.5, max.overlaps = 20), textBy = "gene", top = 20 ) wrap_plots(biplots, guides = "collect")
Finally, we offer functionality to aggregate the results at the protein level instead of the peptide level.
caResProts <- scpComponentAggregate(caResPeps, fcol = "gene") biplots <- scpComponentBiplot( caResCells, caResProts, pointParams = list(aes(colour = SampleType, shape = lcbatch)), labelParams = list(size = 1.5, max.overlaps = 20), textBy = "gene", top = 20 ) wrap_plots(biplots, guides = "collect")
You can manually explore the data through an interactive interface
thanks to iSEE
:
library("iSEE") iSEE(sce)
The final step in this analyses in the identification of the small melanoma subpopulation. To achieve this, we perform an APCA+ with more components to capture most of the variability. Then, we cluster the cells to highlight the subpopulations. We then retrieve the batch corrected data. Finally, we run a new model on the melanoma data to find marker proteins that are strongly differentially abundant between the two cell subpopulations.
Before applying the cluster algorithm, we compute the 20 first APCA+ components on the sample type. We limit the PCA algorithm to 50 iterations to avoid an overly long run and will not compute the PCA for the unmodelled data and the residuals because we won't use it during the downstream analysis.
apcaSampleType <- scpComponentAnalysis( sce, ncomp = 20, method = "APCA", effect = "SampleType", residuals = FALSE, unmodelled = FALSE, maxiter = 50 )$bySample$APCA_SampleType
The resulting components are stored in the SingleCellExperiment
object.
pcCols <- grep("^PC", colnames(apcaSampleType)) reducedDim(sce, "APCA") <- apcaSampleType[, pcCols]
Next, we visualize these 20 components by further decreasing the dimensions using t-SNE.
set.seed(1234) sce <- runTSNE(sce, dimred = "APCA") plotTSNE(sce)
Finally, we cluster cells based on a shared nearest neighbour embedding followed by community detection using the Louvain algorithm.
algo <- SNNGraphParam(k = 30, type = "rank", cluster.fun = "louvain") sce$Cluster <- clusterRows( reducedDim(sce, "APCA"), BLUSPARAM = algo ) plotTSNE(sce, colour_by = "Cluster")
The clustering retrieved 3 cell population, more or less splitting the melanoma cells in 2 subpopulation. Note that $k$ was arbitrarily set to 30. A thorough analysis would optimize K based on an objective criterion, such as maximizing the average silhouette width.
Below is an overview of the unsupervised clusters against the known annotations provided by the authors after thorough investigation.
table( Cluster = sce$Cluster, Population = paste(sce$SampleType, sce$MelanomaSubCluster) )
We annotate the the unsupervised clusters using the available annotations.
sce$Cluster <- recode(sce$Cluster, "1" = "Monocyte", "2" = "Melanoma A", "3" = "Melanoma B")
Based on the model fitted in the previous section, we generate batch-corrected data, that is data with only the effect of cell type and the residual data. We also remove the intercept.
(scebr <- scpRemoveBatchEffect( sce, effects = c("Set", "Channel", "MedianIntensity"), intercept = TRUE ))
Note that the batch-corrected data still contain missing values.
To identify markers proteins, we run a differential abundance analysis as presented in the previous section. We create a new model, focusing only on melanoma data as we don't want to compare subpopulations with monocytes.
sceMel <- scebr[, scebr$SampleType == "Melanoma"] sceMel <- scpModelWorkflow(sceMel, formula = ~ 1 + Cluster)
We then apply the same procedure as described above.
daRes <- scpDifferentialAnalysis(sceMel, contrasts = list(c( "Cluster", "Melanoma A", "Melanoma B" ))) daRes <- scpAnnotateResults( daRes, rowData(sceMel), by = "feature", by2 = "Sequence" ) scpVolcanoPlot(daRes, top = 100, textBy = "gene")
Interestingly, subpopulation be is characterized by higher abundances of MATR3 and SFPQ (involved in response to DNA double-strand break), P4HB and PDIA3 (involved in disulfide bond rearrangement). On the other hand, melanoma subpopulation A shows more elongation factors (EEF1B3, EEF2, ), proteins associated with tubulin (TBCA, TUBB4B, TUBA1C, ...) or proliferation control proteins (PA2G4, ENO1). This is in line with previous findings from the authors.
knitr::opts_chunk$set( collapse = TRUE, comment = "", crop = NULL )
sessionInfo()
citation("scp")
This vignette is distributed under a CC BY-SA license license.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.