NOTES.md

Notes about SCP data analysis

This document summarises thoughts about improving data analysis workflows for mass spectrometry-based single-cell proteomics. The effort here is in response to our recent articles. In the overview of data processing workflows (Vanderaa et Gatto 2023), we highlighted the lack of consensus when processing data leading to a "one lab, one workflow" situation. Although this might be the consequence of heterogeneous data structures, we could show this is not the case and is rather due to the lack of a principled approach. This document provides future tracks for scp and beyond to harmonise the current practices.

Spectral data processing

Van der Watt et al. 2021 showed it is important to benchmark spectrum identification tools. Furthermore, the same lab later reported that the properties of the spectra from SCP experiment do not fulfill the assumption of identification tools designed on bulk proteomics samples (Boekweg et al. 2021).

There is definitely room for improvement, here are some thoughts about SCP spectrum identification:

Data quality controls

Low-quality features (PSMs) and low-quality samples (single-cells) may introduce noise (eg because of low signal due to sample prep issues) or bias (eg because a PSMs is wrongly matched).

It seems sensible to identify and remove low-quality features and samples before they propagate through later processing steps. Data quality control should therefore be the very first step. I find it surprising that many feature or sample quality controls are performed at a later stage. From all the surveyed QC metrics, none require upstream processing and can be applied immediately.

Feature QC

The most difficult part of sample preparation is to get rid of the background signal/contamination such as FACS carry-over (cf discussion with Harrison Specht and Andrew Leduc). Negative controls may be used in a way to model systematic background contamination. This modelling should take into account that the background changes with time.

Checking for protein coverage is barely explored in SCP, except in Orsburn et al. 2022.

Sample QC

Identify failed cells and failed runs based on:

Log-transformation

Normalisation

Normalisation with reference channels

A reference in single-cell is about 5-cell equivalents and is generated by dilution of bulk lysates. This sample is then used to subtract or divide the peptide intensities of the sample in the same batch. Therefore, we should call this batch correction rather than normalisation. I dislike this approach since noise is associated to each reference acquisitions. Olga Vitek suggested including at least 2 channels to better estimate the "normalisation" factor. We also refrained from "normalising" in the linear scale (division by reference), but rather perform it in the log scale (subtraction).

Is normalisation really what we want?

Only applicable for multiplexed experiments: since we perform batch correction for MS acquisition runs and for multiplexing labels (TMT, mTRAQ), is normalisation still required?

I answered this question with scplainer, and yes normalisation is very important and captures cell-specific variations that are not captured by combined modelling of label and MS batch. scplainer currently uses the median peptide abundance as a normalisation variable during modelling. In this case, normalisation captures both technical variation, such as small differences in sample prep, and biological variation, for instance differences in cell size or metabolic activity. Whether the later should be included or excluded from downstream analyses is a conceptual consideration. My intuition is that:

If indeed these two types of abundance matter, the model should be able to decouple cell-specific technical variation from cell-specific biological variation. To do so, we could model the normalisation factor as a function of biological variables such as the cell size. For instance, FACS experiments measure SSC and FSC that provide proxies for granularity and size, respectively. Alternatively, CellenOne measures cell sizes during dispensing. I however noticed in the schoof2021 dataset a poor correlation between FSC/SSC and the median intensity. One explanation is that the factors that drive the need for normalisation are more technical than biological and happen after cell dispensing. Leduc et al. 2022 observed good correlation between total protein intensities and cell sizes, as measured by the CellenOne. I could confirm this finding with our modelling approach using minimal data processing. However, the relationship is not perfect, indicating that technical variability is influencing the intensity. The modelling the normalisation factor as a function of cell size would enable an effective decoupling of biological effects and technical effects. The estimated trend would represent biological normalisation while the residuals would represent technical normalisation, each being included in the scplainer modelling approach.

How to define normalisation factors?

Common practice to estimate sample normalisation factors is to compute median, mean or total intensity for each cell. I would choose median over mean as it is more robust against outliers, but is this the best solution?

Another approach is to align quantiles. While it makes sense for bulk samples, I'm not convinced that different cells should share the same feature intensity distributions, especially in the presence of highly missing data.

Another approach would be to get inspiration from the deconvolution approach implemented in the scran package by Lun et al. 2016, but this would require some efforts to explore the suitability and to adapt the algorithm.

Batch correction

Include batch correction in the statistical modelling

Replacing value with batch corrected values does not include estimation uncertainty.

Include the label factor in the model

It was clear from the Jupiter Notebooks from Schoof et al. 2021 that the TMT label has an impact on quantification. We further confirmed these finding for most multiplexed experiments that the label has an impact on quantifications, but this impact can be modelled.

Batch correction for label-free data

Imputation

Missing mechanisms

SCP data seem not to be MNAR, but I couldn't find a convincing framework to demonstrate this. Instead, I rely on the relationship between average feature intensity and feature detection rate, as suggested by Li and Smyth 2023. However, their approach does not account for the presence of batch effect and may be biased for SCP.

I think SCP data is therefore mostly MCAR. I find the MCAR definition bears only little information. It says that the data is missing unrelated to the underlying value, but there must be other reasons for values to be missing for which we could model the mechanism. For instance, in DDA, we could model the detection rate of a peptide with respect to the MS1 rank of its precursor in each run. Is a peptide not detected because it was not selected for MS2? For DIA, we could adapt the approach and model the detection rate with the expected spectral complexity determined in MS1. Other variables could be peptide properties. Could we predict peptide ionisation based on its physico-chemical properties?

Is imputation really what we want?

We discussed this in Vanderaa et Gatto 2023.

Some alternative approaches that serves as inspiration:

Multiple imputation as a proxy to imputation uncertainty

Benchmarking imputation for SCP

Aggregation

library(scpdata)
specht <- specht2019v3()
featAnn <- rowData(specht[[1]])
featAnn$ID <- paste0(featAnn$Set, featAnn$Sequence, featAnn$Charge, 
                     featAnn$Modifications)
dupl <- featAnn$ID[duplicated(featAnn$ID)]
featAnn %>% 
    data.frame %>% 
    filter(Set == Set[[1]],
           ID %in% dupl) %>% 
    ggplot() +
    aes(y = dart_qval, 
        x = Retention.time,
        colour = ID,
        group = ID) +
    geom_point() +
    geom_line() +
    theme(legend.position = "none")

Include aggregation in the model

Perform protein-level infers using peptide-level data, as performed by the MsqRob model.

Modelling SCP data

Include latent variables

For most datasets, linear regression is sufficient and we can retrieve unknown sources of variation from the residuals. This two-step approach could be formalized through the optimisation of latent variables, as performed in RUV-III-C (Poulos et al. 2020) or ZINB-WaVE (Risso et al. 2018).



UCLouvain-CBIO/scp documentation built on Oct. 12, 2024, 2:37 a.m.