knitr::opts_chunk$set(
    collapse = TRUE,
    comment = "#>",
    crop = NULL
    ## cf https://stat.ethz.ch/pipermail/bioc-devel/2020-April/016656.html
)

Introduction

In this tutorial, we'll walk you through the process of modelling single-cell proteomics (SCP) data using the scplainer approach (@Vanderaa2023-scplainer). By the end of this vignette, you will be able to:

The last point will allow you to generate SCP data that is suitable for downstream analysis, such as clustering or trajectory inference. The figure below provides a roadmap of the workflow:

scplainer modelling workflow

The vignette will start with the processed data extracted as a SingleCellExperiment object from a processed QFeatures object. We will not cover data processing as it is covered in another vignette.

library("scp")
library("SingleCellExperiment")
library("patchwork")
library("ggplot2")

Example data set

The example data set is a subset of the leduc2022_pSCoPE data set (see ?scpdata::leduc2022_pSCoPE for more info). The data is acquired using TMT-18 multiplexing and data-dependent acquisition (DDA). The data has been processed using a minimal workflow:

We suggest using this minimal processing workflow, although the approach presented here is agnostic of previous processing and allows for other custom workflows. The data processing was conducted with QFeatures and scp.

data("leduc_minimal")
leduc_minimal

The data set is formatted as a SingleCellExperiment object. The data set consists of 200 peptides and 73 cells. Peptide annotations can be retrieved from the rowData and cell annotations can be retrieved from the colData. The cell annotation will be used during modelling.

A full reanalysis of Leduc's nPOP dataset is also available here.

Data modelling

The core of the approach relies on statistical modelling of the data using linear regression. Under the hood, the model fetches as input the intensity matrix stored in assay(leduc_minimal). The cell annotations are retrieved using colData(leduc_minimal). They describe known technical and biological variables that may influence the acquired peptide intensities. The annotations are used to build a regression model with $p$ parameters. Then, the model estimates the coefficients. Coefficients provide the contributions of each parameter to the expression of each of peptide as well as the uncertainty of the estimation. These will be explored in the following section.

We'll start by defining the variables to include in the model. Recall that the example data set contains TMT-labeled cells. This means that each MS acquisition run contains multiple cells. Each run is subject to technical fluctuations that can lead to undesired variation, this is known as a batch effect.

table(leduc_minimal$Set)

The labelling reagent (Channel) can also lead to undesired systematic effects and will also be considered as a source for batch effects.

table(leduc_minimal$Channel)

Finally, each cell is processed individually and the amount of peptide material recovered from each cell may lead to undesired variation as well. This issue is usually solved through normalization, such as removing the median intensity from each cell. Normalization was internationally omitted in the minimal data processing so that we can account for it during modelling. The median intensity were already computed (MedianIntensity).

hist(leduc_minimal$MedianIntensity, breaks = 10)

Finally, the biological variable of interest in the example data set is the cell type that is known because cells come from 2 cell lines (SampleType).

table(leduc_minimal$SampleType)

We create a formula object that will define which variable must be modelled in our analysis.

f <- ~ 1 + ## intercept
    Channel + Set + ## batch variables
    MedianIntensity + ## normalization
    SampleType ## biological variable

Note that the formula can be adapted to the data set. For instance, no labelling reagents is used for LFQ experiments, so it can be dropped. Similarly, each cell in an LFQ experiment is acquired in a single run so MS run cannot be used as a batch effect variable. The day of acquisition could be used instead.

Once a model is defined, we fit it with scpModelWorkflow().

leduc_minimal <- scpModelWorkflow(leduc_minimal, formula = f)

You can always retrieve the formula that was used to fit model using

scpModelFormula(leduc_minimal)

The data that is modelled by each variable are contained in the so-called effect matrices.

scpModelEffects(leduc_minimal)

Similarly, the data that could not be captured by the model are contained in the residual matrix.

scpModelResiduals(leduc_minimal)[1:5, 1:5]

Finally, the input data used to model the can also be retrieved.

scpModelInput(leduc_minimal)[1:5, 1:5]

Note that the number of peptides changed. This is the consequence of peptide filtering.

dim(scpModelInput(leduc_minimal))

Peptide filtering

The proportion of missing values for each features is high in single-cell proteomics data.

table(missing = is.na(assay(leduc_minimal)))

Many features can typically contain more coefficients to estimate than observed values. These features cannot be estimated and will be ignored during further steps. These features are identified by computing the ratio between the number of observed values and the number of coefficients to estimate. We call it the n/p ratio. You can extract the n/p ratio for each feature:

head(scpModelFilterNPRatio(leduc_minimal))

Once the model is estimated, use scpModelFilterPlot() to explore the distribution of n/p ratios across the features.

scpModelFilterPlot(leduc_minimal)

By default, any feature that has an n/p ration greater than 1 is included in the analysis. However, feature with an n/p ratio close to 1 may lead to unreliable outcome because there are not enough observed data. You could consider the n/p ratio as the average number of replicate per coefficient to estimate. Therefore, you may want to increase the n/p threshold.

scpModelFilterThreshold(leduc_minimal) ## default is 1
scpModelFilterThreshold(leduc_minimal) <- 1.5
scpModelFilterThreshold(leduc_minimal) ## threshold is now 1.5

The plot is automatically updated.

scpModelFilterPlot(leduc_minimal)

There is no guidelines for defining a suitable threshold. If too low, you may include noisy peptides that have too few observations. If too high, you may remove many informative peptides. The definition of the threshold relies on a trade off between precision and sensitivity.

Model exploration: analysis of variance

The variance analysis reports the relative amount of information that is captured by each cell annotation included in the model. The model also includes the residual information that is not captured by the model. This offers a first glimpse into what information is contained in the data.

(vaRes <- scpVarianceAnalysis(leduc_minimal))

The results are a list of tables, one table for each variable. Each table reports for each peptide the variance captures (SS), the residual degrees of freedom for estimating the variance (df) and the percentage of total variance explained (percentExplainedVar).

vaRes$SampleType

By default, we explore the variance for all peptides combined:

scpVariancePlot(vaRes)

We explore the top 20 peptides that are have the highest percentage of variance explained by the biological variable (top) or by the batch variable (bottom).

scpVariancePlot(
    vaRes, top = 10, by = "percentExplainedVar", effect = "SampleType",
    decreasing = TRUE, combined = FALSE
) +
    scpVariancePlot(
    vaRes, top = 10, by = "percentExplainedVar", effect = "Set",
    decreasing = TRUE, combined = FALSE
) +
    plot_layout(ncol = 1, guides = "collect")

We can also group the peptide by protein. To do so, we first need to add the peptides annotations available from the rowData.

vaRes <- scpAnnotateResults(
    vaRes, rowData(leduc_minimal), by = "feature", by2 = "Sequence"
)
vaRes$SampleType

Then, we draw the same plot, but this time we provide the fcol argument.

scpVariancePlot(
    vaRes, top = 10, by = "percentExplainedVar", effect = "SampleType",
    decreasing = TRUE, combined = FALSE, fcol = "gene"
) +
    scpVariancePlot(
    vaRes, top = 10, by = "percentExplainedVar", effect = "Set",
    decreasing = TRUE, combined = FALSE, fcol = "gene"
) +
    plot_layout(ncol = 1, guides = "collect")

In this example dataset, we retrieve peptides that all belong to a different protein, however grouping becomes interesting when analyzing real data sets.

Alternatively, we can generate protein level results by aggregating peptide level results.

vaProtein <- scpVarianceAggregate(vaRes, fcol = "gene")
scpVariancePlot(
    vaProtein, effect = "SampleType", top = 10, combined = FALSE
)

Model exploration: differential abundance analysis

Differential abundance analysis dives deeper into the exploration of the data, namely for exploring the biological effects. Given two groups of interest, such as two cell types or two treatment groups, the differential analysis derives estimated fold changes from the linear model's coefficients. This provides information, for each peptide or protein, the amount of change between the two groups and the direction of the change. Moreover, the model provides the uncertainty of the differences, enabling the assessment of the statistical significance.

The difference of interest is specified using the contrast argument. The first element points to the variable to test and the two following element are the groups of interest to compare. You can provide multiple contrast in a list.

(daRes <- scpDifferentialAnalysis(
    leduc_minimal,
    contrasts = list(c("SampleType", "Melanoma", "Monocyte"))
))

Similarly to variance analysis, the results are a list of tables, one table for each contrast. Each table reports for each peptide the estimated difference between the two groups, the standard error associated to the estimation, the degrees of freedom, the t-statistics, the associated p-value and the p-value FDR-adjusted for multiple testing across all peptides.

daRes$SampleType_Melanoma_vs_Monocyte

We then visualize the results using a volcano plot. The function below return a volcano plot for each contrast.

scpVolcanoPlot(daRes)

Since we subset the data set for only a few cell, we lack statistical power. Still, two peptides come out as significant. Again, to better explore the results, we add peptide annotations available from the rowData, but we also add the n/p ratio as annotation.

daRes <- scpAnnotateResults(
    daRes, rowData(leduc_minimal),
    by = "feature", by2 = "Sequence"
)
np <- scpModelFilterNPRatio(leduc_minimal)
daRes <- scpAnnotateResults(
    daRes, data.frame(feature = names(np), npRatio = np),
    by = "feature"
)

We plot the same volcano plot, but instead of labeling points with the peptide sequence, we will show the associated gene symbol. Also, we can control for point aesthetics by providing a list of ggplot2::geom_point() arguments. For example, we can colour each point based on the n/p ratio, and adjust point size and shape.

scpVolcanoPlot(
    daRes, top = 30, textBy = "gene",
    pointParams = list(aes(colour = npRatio), size = 1.5, shape = 3)
)

We can also provide protein-level results. To do so, the scpDifferentialAggregate() relies on the metapod package. We here combine the statistical test results for peptides that belong to the same protein using Simes' method. Simes' method will reject the combined null hypothesis (that is the mean protein intensities are identical between two groups) if any of the peptide nulls are rejected.

byProteinDA <- scpDifferentialAggregate(
    daRes, fcol = "gene", method = "simes"
)
byProteinDA$SampleType_Melanoma_vs_Monocyte

Model exploration: component analysis

Variance and differential analysis are not specific to single-cell applications and explore the data without considering cellular heterogeneity. The purpose of the component analysis is to dive into the cellular heterogeneity by representing highly dimensional data in a few informative dimensions for visual exploration. We integrate the component analysis with the linear regression model thanks to the APCA+ (extended ANOVA-simultaneous component analysis) framework developed by Thiel et al. 2017. Briefly, APCA+ explores the reconstructed data that is captured by each variable separately in the presence of the unmodelled data. The advantage of this framework is it is generic and works for any linear model. Also, this approach is well suited for single-cell applications as it enables the visualization and exploration of the effects of a known variable along the unmodelled information that contains cellular heterogeneity.

(caRes <- scpComponentAnalysis(
    leduc_minimal, ncomp = 20, method = "APCA", effect = "SampleType"
))

The results are contained in a list with 2 elements. bySample contains the PC scores, that is the component results in cell space. byFeature contains the eigenvectors, that is the component results in peptide space. Each of the two elements contains components results for the data before modelling (unmodelled), for the residuals or for the APCA on the sample type variable (APCA_SampleType).

(caResCells <- caRes$bySample)
caResCells[[1]]

Let's explore the component analysis in cell space. Similarly to the previous explorations, we annotate the results.

leduc_minimal$cell <- colnames(leduc_minimal)
caResCells <- scpAnnotateResults(
    caResCells, colData(leduc_minimal), by = "cell"
)

We then generate the component plot. Providing the pointParams argument, we can shape the points by SampleType. To assess the impact of batch effects, we also colour the points according to the MS acquisition run.

scpComponentPlot(
    caResCells,
    pointParams = list(aes(shape = SampleType, colour = Set))
) |>
    wrap_plots(ncol = 1, guides = "collect")

While the data before modelling is mainly driven by batch effects, the APCA clearly separates the two cell populations. The plot can however only show 2 components at a time. We can explore more components using a subsequent dimension reduction, such as t-SNE. The scater package offers a comprehensive set of tools for dimension reduction on data contained in a SingleCellExperiment object and requires the components to be stored in the reducedDim slot. This is streamlined thanks to addReducedDims().

leduc_minimal <- addReducedDims(leduc_minimal, caResCells)
reducedDims(leduc_minimal)

We can now explore the SampleType effects for the 20 computed components through t-SNE.

library("scater")
leduc_minimal <- runTSNE(leduc_minimal, dimred = "APCA_SampleType")
plotTSNE(leduc_minimal, colour_by = "Set", shape_by = "SampleType") +
    ggtitle("t-SNE on 20 APCA components")

The two cell populations remain clearly separated with an excellent mixing of the acquisition runs, even when considering the 20 first APCA components.

We use the same approach to explore the component results in peptide space.

caResPeps <- caRes$byFeature
caResPeps <- scpAnnotateResults(
    caResPeps, rowData(leduc_minimal), by = "feature", by2 = "Sequence"
)
scpComponentPlot(
    caResPeps, pointParams = list(size = 0.8, alpha = 0.4)
) |>
    wrap_plots(ncol = 1)

This exploration may identify groups of covarying peptides, although no clear patterns appear in the example data set.

We can also combine the exploration of the components in cell and peptide space. This is possible thanks to biplots.

scpComponentBiplot(
    caResCells, caResPeps,
    pointParams = list(aes(colour = SampleType)),
    labelParams = list(size = 1.5, max.overlaps = 15),
    textBy = "gene", top = 10
) |>
    wrap_plots(ncol = 1, guides = "collect")

Finally, we offer functionality to aggregate the results at the protein level instead of the peptide level.

caResProts <- scpComponentAggregate(caResPeps, fcol = "gene")
caResProts$APCA_SampleType

Note that the aggregated tables in caResProts can be explored with the visualisation function scpComponentPlot().

Batch correction

Based on the estimated model, we generate batch-corrected data, that is data with only the effect of cell type and the residual data. We also remove the intercept.

(leduc_batchCorrect <- scpRemoveBatchEffect(
    leduc_minimal, effects = c("Set", "Channel", "MedianIntensity"),
    intercept = TRUE
))

Note that the batch-corrected data still contain missing values. The leduc_batchCorrect object can be used for downstream analysis.

Session information {-}

knitr::opts_chunk$set(
    collapse = TRUE,
    comment = "",
    crop = NULL
)
sessionInfo()

License {-}

This vignette is distributed under a CC BY-SA license license.

Reference {-}



UCLouvain-CBIO/scp documentation built on Oct. 12, 2024, 2:37 a.m.