Single Cell Proteomics data processing and analysis.

knitr::opts_chunk$set(
    collapse = TRUE,
    comment = "#>",
    crop = NULL ## Related to https://stat.ethz.ch/pipermail/bioc-devel/2020-April/016656.html
)

The scp package

The scp package is used to process and analyse mass spectrometry-based single cell proteomics (SCP) data. It relies on the QFeatures (@QFeatures) package to manage and process SingleCellExperiment (@sce) objects.

knitr::include_graphics("./figures/SCP_framework.png", error = FALSE)

This vignette will guide you through the common steps of mass spectrometry-based single-cell proteomics data analysis. To start, we load the scp package.

library("scp")

We also load ggplot2, magrittr and dplyr for convenient data manipulation and plotting.

library("ggplot2")
library("magrittr")
library("dplyr")

Read in SCP data

The workflow starts with reading in the tabular quantification data generated by, for example, MaxQuant (@Cox:2008). We created a small example data by subsetting the MaxQuant table provided in the SCoPE2 preprint (@Specht2019-jm). The mqScpData table is a typical example of what you would get after reading in a CSV file using read.csv or read.table. See ?mqScpData for more information about the table content.

data("mqScpData")
dim(mqScpData)
mqScpData[1:10, 1:5]

In order to convert this tabular data to a scp-compatible QFeatures object, we need to provide a metadata table where rows contain sample information and columns must contain at least:

Any additional information about the samples will be stored in the colData.

We provide an example of such a data frame. It was formatted from the annotation table provided in the SCoPE2 preprint. See ?sampleAnnotation for more information about the table content.

data("sampleAnnotation")
sampleAnnotation

The two tables are supplied to the readSCP function.

scp <- readSCP(quantTable = mqScpData,
               metaTable = sampleAnnotation,
               channelCol = "Channel",
               batchCol = "Set")

As indicated by the output on the console, readSCP proceeds as follows:

  1. If quantTable is the path to a CSV file, it reads the file using read.csv. The table is converted to a SingleCellExperiment object. readSCP needs to know in which field(s) the quantative data is stored. Those field name(s) is/are provided by the channelCol field in the metaData table. So in this example, the column names hodling the quantitative data in mqScpData are stored in the column named Channel in sampleAnnotation.

  2. The SingleCellExperiment object is then split according to batch. The split is performed depending on the batchCol field in quantTable, in this case the data will be split according to the Set column in mqScpData.

  3. The sample metadata is generated from the metaTable. Note that in order for readSCP to correctly match the feature data with the metadata, metaTable must also contain the batchCol field with batch names.

  4. Finally, the split feature data and the sample metadata are stored in a single QFeatures object.

Here is a compact overview of the data:

scp

We can see that the scp object we created is a QFeatures object containing 4 assays. Each assay has an associated name, this is the batch name that was used for splitting. We can also see that each assay is a SingleCellExperiment object. The rows represent the peptide to spectrum matches (PSMs), the number vary depending on the batch. Finally, all three assays contains 16 columns that correspond to the 16 TMT channels recorded during the 4 MS runs.

Clean missing data

Single-cell (proteomics or transcriptomics) data contains many zeros. The zeros can be biological zeros or technical zeros and differentiating between the two types is very hard. To avoid artefacts in dowstream steps, we replace the zeros by the missing value NA. The zeroIsNA function takes the QFeatures object and the name(s) or index/indices of the assay(s) to clean and automatically replaces any zero in the selected quantitative data by NA.

scp <- zeroIsNA(scp, i = 1:4)

Filter PSMs

A common steps in SCP is to filter out low-confidence PSMs. Each PSM assay contains feature meta-information that are stored in the assay's rowData. The QFeatures package allows to quickly filter the rows of an assay by using these information. The available variables in the rowData are listed below for each assay.

rowDataNames(scp)

Filter features based on feature metadata

Below are some examples of criteria that are used to identify low-confidence. The information is readily available since this was computed by MaxQuant:

We can perform this filtering using the filterFeatures function. filterFeatures automatically accesses the feature metadata and selects the rows that meet the provided condition(s). For instance, Reverse != "+" keeps the rows for which the Reverse variable in the rowData is not "+" (i.e. the PSM is not matched to the decoy database).

scp <- filterFeatures(scp,
                      ~ Reverse != "+" &
                        !grepl("REV|CON", protein) &
                        Potential.contaminant != "+" &
                        !is.na(PIF) & PIF > 0.8)

Filter assays based on detected features

To avoid proceeding with failed runs, another interesting filter is to remove assays with too few features. If a batch contains less than, for example, 150 features we can then suspect something wrong happened in that batch and it should be removed. Using dims, we can query the dimensions (hence the number of features and the number of samples) of all assays contained in the dataset.

dims(scp)

Actually, a QFeatures object can be seen as a three-order array: $features \times samples \times assay$. Hence, QFeatures supports three-order subsetting x[rows, columns, assays]. We first select the assay that have sufficient PSMs (the number of rows is greater than 150), and then subset the scp object for the assays that meet the criterion.

keepAssay <- dims(scp)[1, ] > 150
scp <- scp[, , keepAssay]
scp

Notice the 190321S_LCA10_X_FP97_blank_01 sample was removed because it did not contain sufficient features, as expected from a blank sample. This could also have been the case for failed runs.

Filter features based on SCP metrics

Another type of filtering is specific to SCP. In the SCoPE2 analysis, the authors suggest a filters based on the sample to carrier ratio (SCR), that is the reporter ion intensity of a single-cell sample divided by the reporter ion intensity of the carrier channel (200 cells) from the same batch. It is expected that the carrier intensities are much higher than the single-cell intensities.

The SCR can be computed using the computeSCR function from scp. The function must be told which channels are the samples that must be divided and which channel contains the carrier. This information is provided in the sample metadata and is accessed using the colData, under the SampleType field.

colData(scp)[, "SampleType"] %>%
  table

In this dataset, SampleType gives the type of sample that is present in each TMT channel. The SCoPE2 protocole includes 5 types of samples:

The computeSCR function expects the following input:

The function creates an field .meanSCR and stores it in the rowData of each assay.

scp <- computeSCR(scp,
                  i = 1:3,
                  colDataCol = "SampleType",
                  carrierPattern = "Carrier",
                  samplePattern = "Macrophage|Monocyte")

Before applying the filter, we plot the distribution of the average SCR. We can extract the .meanSCR variable from the rowData of several assays using the rowDataToDF. It takes the rowData field(s) of interest and returns a DataFrame table.

scp %>%
  rowDataToDF(i = 1:3,
              vars = ".meanSCR") %>%
  data.frame %>%
  ggplot(aes(x = .meanSCR)) +
  geom_histogram() +
  geom_vline(xintercept = 0.1)

Most values are close to 0.02 as expected since the experimental ratio is 1/200. There are a few point that have higher signal than expected. We therefore filter out those points using a cutoff of 0.1 using again the filterFeatures functions.

scp <- filterFeatures(scp,
                      ~ !is.na(.meanSCR) &
                        .meanSCR < 0.1)

Filter out PSMs with high false discovery rate

Finally, a last PSM filter criterion is the false discovery rate (FDR) for identification. Filtering on the PEP is too conservative (@Kall2008-hb) so we provide the computeFDR function to convert PEPs to FDR. Beside the dataset and the assay(s) for which to compute the FDR, we also need to give the feature grouping variable, here the peptide sequence, and the variable containing the PEPs. Those are contained in the feature metadata.

scp <- computeFDR(scp,
                  i = 1:3,
                  groupCol = "peptide",
                  pepCol = "dart_PEP")

Note that a new variable .FDR containing the computed FDRs is added to the rowData. We filter the PSMs that have an associated peptide FDR smaller than 1\%.

scp <- filterFeatures(scp,
                      ~ .FDR < 0.01)

Process the PSM data

Relative reporter ion intensity

In order to partialy correct for between-run variation, SCoPE2 suggests computing relative reporter ion intensities. This means that intensities measured for single-cells are divided by the reference channel containing 5-cell equivalents. We use the divideByReference function that divides channels of interest by the reference channel. Similarly to computeSCR, we can point to the samples and the reference columns in each assay using the annotation contained in the colData.

We here divide all columns (using the regular expression wildcard .) by the reference channel (Reference).

scp <- divideByReference(scp,
                         i = 1:3,
                         colDataCol = "SampleType",
                         samplePattern = ".",
                         refPattern = "Reference")

Aggregate PSM data to peptide data

Now that the PSM assays are processed, we can aggregate them to peptides. This is performed using the aggregateFeaturesOverAssays function. For each assay, the function aggregates several PSMs into a unique peptide. This is best illustrated by the figure below.

knitr::include_graphics("./figures/feature_aggregation.png", error = FALSE)

Remember there currently are three assays containing the PSM data.

scp

The PSMs are aggregated over the fcol feature variable, here peptides. We also need to supply an aggregating function that will tell how to combine the quantitative data of the PSMs to aggregate. We here use the median value. We name the aggregated assays using the original names and appending peptides_ at the start.

scp <- aggregateFeaturesOverAssays(scp,
                                   i = 1:3,
                                   fcol = "peptide",
                                   name = paste0("peptides_", names(scp)),
                                   fun = matrixStats::colMedians, na.rm = TRUE)

Notice that 3 new assays were created in the scp object. Those new assays contain the aggregated features while the three first assays are unchanged. This allows to keep track of the data processing.

scp

Under the hood, the QFeatures architecture preserves the relationship between the aggregated assays. See ?AssayLinks for more information on relationships between assays.

Join the SCoPE2 sets in one assay

Up to now, we kept the data belonging to each MS run in separate assays. We now combine all batches into a single assay. This is done using the joinAssays function from the QFeatures package. Note that we now use the aggregated assays, so assay 4 to 6.

scp <- joinAssays(scp,
                  i = 4:6,
                  name = "peptides")

In this case, one new assay is created in the scp object that combines the data from assay 4 to 6. The samples are always distinct so the number of column in the new assay (here $48$) will always equals the sum of the columns in the assays to join (here $16 + 16 + 16$). The feature in the joined assay might contain less features than the sum of the rows of the assays to join since common features between assays are joined in a single row.

scp

Filter single-cells

Another common step in single-cell data analysis pipelines is to remove low-quality cells. After subseting for the samples of interest, we will use 2 metrics: the median relative intensities per cell and the median coefficient of variation (CV) per cell.

Filter samples of interest

We first subset the cells of interest, that is the blank samples (sc_0), the macrophages (sc_m0) and the monocytes (sc_u).

We extract the SingleCellExperiment assay with the joined peptides, subset the channels corresponding to blank, macrophage or monocytes and add it as a new assay in the QFeatures object. However, the sample annotation is contained in the colData of the QFeatures dataset, but we need to access it from the SingeCellExperiment object. Therefore, we provide the transferColDataToAssay to copy the sample metadata from the QFeatures to a target assay.

colData(scp[["peptides"]])
scp <- transferColDataToAssay(scp, "peptides")
colData(scp[["peptides"]])

Once the metadata is transfered, we can subset the SingleCellExperiment assay.

sce <- scp[["peptides"]]
sce <- sce[, sce$SampleType %in% c("Blank", "Macrophage", "Monocyte")]

Then we add this assay back into the QFeatures object. This is done using the addAssay function. We also preserve the links between dfeatures using the addAssayLinkOneToOne. Since the features did not change (only the samples did), one-to-one links between features are added.

scp %>%
  addAssay(y = sce,
           name = "peptides_filter1") %>%
  addAssayLinkOneToOne(from = "peptides",
                       to = "peptides_filter1") ->
  scp

Note the added assay peptides_filter1 contains the same number of rows than its parent assay peptides, but less samples as 20 samples are not single-cells or blank samples bu rather unused, reference or carrier samples.

scp

Filter based on the median relative intensity

We compute the median relative reporter ion intensity for each cell separately and apply a filter based on this statistic. This procedure recalls that of library size filtering commonly performed in scRNA-Seq data analysis, where the library size is the sum of the counts in each single cell. We will store the median intensity in the colData of the peptides_filter1 assay (so the SingleCellExperiment object) because this metric is specific to that assay. The medians are computed on the quantitative data using colMedians that requires a data matrix. The data matrix can be extracted from a SingleCellExperiment using the assay function.

scp[["peptides_filter1"]] %>%
  assay %>%
  matrixStats::colMedians(na.rm = TRUE) ->
  scp[["peptides_filter1"]]$MedianRI

Looking at the distribution of the median per cell can highlight low-quality cells.

scp[["peptides_filter1"]] %>%
  colData %>%
  data.frame %>%
  ggplot(aes(x = MedianRI, fill = SampleType)) +
  geom_boxplot() +
  scale_x_log10()

Here all values seems reasonable so no filtering is needed. If it were not the case, the same procedure as the previous section can be used for selecting the cells that have an associated median RI lower that a defined threshold.

Filter based on the median CV

The median CV measures the consistency of quantification for a group of peptides that belong to a protein. We remove cells that exhibit high median CV over the different proteins. We compute the median CV per cell using the computeMedianCV function from the scp package. The function takes the peptides_filter1 assay, protein and peptide information from the assay rowData, and the batch names in the colData of the QFeatures object. The peptide and batch information are required to perform normalization prior to the CV computation. Protein information is required since the CV are computed at the protein level. The computed median CVs are automatically stored in the colData of the SingleCellExperiment assay under the .medianCV field

scp <- computeMedianCV(scp,
                       i = "peptides_filter1",
                       proteinCol = "protein",
                       peptideCol = "peptide",
                       batchCol = "Set")

The computed CVs are stored in the colData of the peptides_filter1 assay and holds the median CV per cell computed using at least 5 observations (peptides). The main interest of computing the median CV per cell is to filter cells with reliable quantification. The blank samples are not expected to have reliable quantifications and hence can be used to estimate an empirical null distribution of the CV. This distribution helps defining a threshold that filters out single-cells that contain noisy quantification.

scp[["peptides_filter1"]] %>%
  colData %>%
  data.frame %>%
  ggplot(aes(x = .MedianCV,
             fill = SampleType)) +
  geom_boxplot() +
  geom_vline(xintercept = 0.3)

We can see that the protein quantifiation for single-cells are much more consistent within cells than witin the blank channels. Based on the distribution of the blanks, we decide to keep the cells that have a median CV lower than 0.3. Note this example is inaccurate because the null distribution is based on only 3 blank samples, but more sets could lead to a better estimation of the CV null distribution. Note we also keep only the samples of interest (macrophages and monocytes) since all QC metrics are now computed.

keepSample <- scp[["peptides_filter1"]]$.MedianCV < 0.3 &
  scp[["peptides_filter1"]]$SampleType %in% c("Macrophage", "Monocyte")
keepSample[is.na(keepSample)] <- FALSE

Then, we subset the assay for samples that pass the filter.

sce <- scp[["peptides_filter1"]]
sce <- sce[, keepSample]

Finally, we add the filtered assay to the QFeaturesobject and created the required linking with the previous assay.

addAssay(scp,
         y = sce,
         name = "peptides_filter2") %>%
  addAssayLinkOneToOne(from = "peptides_filter1",
                       to = "peptides_filter2") ->
  scp

Process the peptide data

In this vignette, the peptide data are further processed before aggregation to proteins. The steps are: normalization, filter peptides based on missing data and log-transformation.

Normalization

The columns (samples) of the peptide data are first normalized by dividing the relative intensities by the median relative intensities. Then, the rows (peptides) are normalized by dividing the relative intensities by the mean relative intensities. The normalized data is stored in a separate assay. This normalization procedure is suggested in the SCoPE2 analysis and is applied using the sweep method. Beside the dataset and the assay to normalize, the method expects a MARGIN, that is either row-wise (1) or column-wise (2) transformation, the FUN function to apply and STATS, a vector of values to apply. More conventional normalization procedure can be found in ?QFeatures::normalize.

scp %>%
    ## Divide columns by median
    sweep(i = "peptides_filter2",
        MARGIN = 2,
        FUN = "/",
        STATS = colMedians(assay(scp[["peptides_filter2"]]),
                           na.rm = TRUE),
        name = "peptides_norm_col") %>%
    ## Divide rows by mean
    sweep(i = "peptides_norm_col",
        MARGIN = 1,
        FUN = "/",
        STATS = rowMeans(assay(.[["peptides_norm_col"]]),
                         na.rm = TRUE),
        name = "peptides_norm") ->
    scp

Notice each call to sweep created a new assay.

scp

Remove peptides with high missing rate

Peptides that contain many missing values are not informative. Therefore, another common procedure is to remove higly missing data. In this example, we remove peptides with more than 99 \% missing data. This is done using the filterNA function from QFeatures.

scp <- filterNA(scp,
                i = "peptides_norm",
                pNA = 0.99)

Log-transformation

In this vignette, we perform log2-transformation using the logTransform method from QFeatures. Other log-transformation can be applied by changing the base argument.

scp <- logTransform(scp,
                    base = 2,
                    i = "peptides_norm",
                    name = "peptides_log")

Similarly to sweep, logTransform creates a new assay in scp.

Aggregate peptide data to protein data

Similarly to aggregating PSM data to peptide data, we can aggregate peptide data to protein data using the aggregateFeatures function.

scp <- aggregateFeatures(scp,
                         i = "peptides_log",
                         name = "proteins",
                         fcol = "protein",
                         fun = matrixStats::colMedians, na.rm = TRUE)

The only difference between aggregateFeatures and aggregateFeaturesOverAssays is that the second function can aggregate several assay at once whereas the former only takes one assay to aggregate. Hence, only a single assay, proteins, was created in the scp object.

scp

After the second aggregation, the proteins assay in this example contains quantitative information for 89 proteins in 15 single-cells.

Process the protein data

The protein data is further processed in three steps: normalization, imputation (using the KNN algorithm) and batch correction (using the ComBat algorithm).

Normalization

Normalization is performed similarly to peptide normalization. We use the same functions, but since the data were log-transformed at the peptide level, we subtract by the statistic (median or mean) instead of dividing.

scp %>%
    ## Center columns with median
    sweep(i = "proteins",
          MARGIN = 2,
          FUN = "-",
          STATS = colMedians(assay(scp[["proteins"]]),
                             na.rm = TRUE),
          name = "proteins_norm_col") %>%
    ## Center rows with mean
    sweep(i = "proteins_norm_col",
          MARGIN = 1,
          FUN = "-",
          STATS = rowMeans(assay(.[["proteins_norm_col"]]),
                           na.rm = TRUE),
          name = "proteins_norm") ->
    scp

Imputation

The protein data contains a lot of missing values.

scp[["proteins_norm"]] %>%
  assay %>%
  is.na %>%
  mean

The average missingness in the proteins assay is around 25 \%. Including more samples and hence more batches can increase the missingness up to 70 \% as seen for the complete SCoPE2 dataset (@Specht2019-jm). Whether imputation is beneficial or deleterious for the data will not be discussed in this vignette. But taking those missing value into account is essential to avoid artefacts in downstream analyses. The data imputation is performed using the K nearest neighbors algorithm, with k = 3. This is available from the impute mehtod. More details about the arguments can be found in ?impute::impute.knn.

scp <- impute(scp,
              i = "proteins_norm",
              method = "knn",
              k = 3, rowmax = 1, colmax= 1,
              maxp = Inf, rng.seed = 1234)

Note that after imputation, no value are missing.

scp[["proteins_norm"]] %>%
  assay %>%
  is.na %>%
  mean

Batch correction

A very important step for processing SCP data is to correct for batch effects. Batch effects are caused by technical variation occuring during different MS runs. Since only a small number of single-cells can be acquired at once, batch effects are unavoidable.

The ComBat function from the sva package can be used to perform batch correction as it is performed in the SCoPE2 analysis. We do not claim that ComBat is the best algorithm for batch correcting SCP data and other batch correcting methods could be used using the same procedure.

We first extract the assay to process.

scp <- transferColDataToAssay(scp, i = "proteins_norm")
sce <- scp[["proteins_norm"]]

Next, we need to provide a design matrix and the batch annotation to Combat. The design matrix allows to protect variables of interest, in our case SampleType.

batch <- colData(sce)$Set
model <- model.matrix(~ SampleType, data = colData(sce))

We then load and call ComBat and overwrite the data matrix. Recall the data matrix can be accessed using the assay function.

library(sva)
assay(sce) <- ComBat(dat = assay(sce),
                     batch = batch,
                     mod = model)

Finally, we add the batch corrected assay to the QFeatures object and create the feature links.

addAssay(scp,
         y = sce,
         name = "proteins_batchC") %>%
  addAssayLinkOneToOne(from = "proteins_norm",
                       to = "proteins_batchC") ->
  scp

Dimension reduction

Because each assay contains SingelCellExperiment objects, we can easily apply methods developed in the scRNA-Seq field. A useful package for dimension reduction on single-cell data is the scater.

library(scater)

This package provides streamline functions to computes various dimension reduction such as PCA, UMAP, t-SNE, NMF, MDS, ....

PCA

PCA can be computed using the runPCA method. It returns a SingleCellExperiment object for which the dimension reduction results are stored in the reducedDim slot.

scp[["proteins_batchC"]] <- runPCA(scp[["proteins_batchC"]],
                                   ncomponents = 5,
                                   ntop = Inf,
                                   scale = TRUE,
                                   exprs_values = 1,
                                   name = "PCA")

The computed PCA can be displayed using the plotReducedDim function. The dimred arguments should give the name of the dimension reduction results to plot, here we called it PCA. The samples are colored by type of sample.

plotReducedDim(scp[["proteins_batchC"]],
               dimred = "PCA",
               colour_by = "SampleType",
               point_alpha = 1)

This is a minimalistic example with only a few plotted cells, but the original SCoPE2 dataset contained more than thousand cells.

UMAP

Similarly to PCA, we can compute a UMAP using the runUMAP method. Note however that the UMAP implementation requires a initialization, usually provided by PCA. The previous PCA results are used automatically when supplying dimred = "PCA" (PCA is the name of the dimension reduction result that we supplied in the previous section).

scp[["proteins_batchC"]] <- runUMAP(scp[["proteins_batchC"]],
                                    ncomponents = 2,
                                    ntop = Inf,
                                    scale = TRUE,
                                    exprs_values = 1,
                                    n_neighbors = 3,
                                    dimred = "PCA",
                                    name = "UMAP")

The computed UMAP can be displayed using the plotReducedDim function. The dimred arguments gives the name of the dimension reduction results to plot, here we called it UMAP. The samples are colored by type of sample.

plotReducedDim(scp[["proteins_batchC"]],
               dimred = "UMAP",
               colour_by = "SampleType",
               point_alpha = 1)

The UMAP plot is a very interesting plot for large datasets. A UMAP on this small example dataset is not useful but is shown for illustration.

Monitoring data processing

The QFeatures plot shows the quantitative data for a features at the different expression levels. For instance, suppose we are interested in the protein LMNA (protein ID is P02545). A useful QC is to monitor the data processing at the PSM, peptide and protein level. This can easily be done thanks to the QFeatures framework. Using the subsetByFeature, we can extract the protein of interest and its related features in the other assays. The data is formated to a long format table that can easily be plugged in the ggplot2 visualization tool.

scp %>%
  ## Get the features related to LMNA (P02545)
  subsetByFeature("P02545") %>%
  ## Format the `QFeatures` to a long format table
  longFormat(colDataCols = c("Set", "SampleType", "Channel")) %>%
  data.frame %>%
  ## This is used to preserve ordering of the samples and assays in ggplot2
  mutate(assay = factor(assay, levels = names(scp)),
         Channel = sub("RI", "TMT-", Channel),
         Channel = factor(Channel, levels = unique(Channel))) %>%
  ## Start plotting
  ggplot(aes(x = Channel, y = value, group = rowname, col = SampleType)) +
  geom_point() +
  ## Plot every assay in a separate facet
  facet_wrap(facets = vars(assay), scales = "free_y", ncol = 3) +
  ## Annotate plot
  xlab("Channels") +
  ylab("Intensity (arbitrary units)") +
  ## Improve plot aspect
  theme(axis.text.x = element_text(angle = 90),
        strip.text = element_text(hjust = 0),
        legend.position = "bottom")

This graph helps to keep track of the data processing and highlights anomalies in the data. For instance, we can see that no data were recorded for 5 channels (TMT-12 to 16). Those channels contain either macrophages or monocytes (from the experimental design) and are expected to contain information. Thanks to this diagnostic plot, we reported the issue to Specht and colleagues and this led to version 3 of the SCoPE2 dataset.

Session information {-}

knitr::opts_chunk$set(
    collapse = TRUE,
    comment = "",
    crop = NULL
)
sessionInfo()

Reference



Try the scp package in your browser

Any scripts or data that you put into this service are public.

scp documentation built on Nov. 8, 2020, 8:20 p.m.