## Options for Rmarkdown compilation knitr::opts_chunk$set(fig.width = 7, fig.height = 5, fig.align = "center", out.width = "70%", message = FALSE, collapse = TRUE, crop = NULL ## Related to https://stat.ethz.ch/pipermail/bioc-devel/2020-April/016656.html ) ## Time the compilation timeStart <- Sys.time()
Williams et al. present an auto-sampling device to interface between the nanoPOTS processing workflow and the LC-MS/MS. This allows for increased automation of the technology that is critical for advancing the field to real single-cell proteomics applications.
The authors did not provided the code required to fully reproduce their analysis. We therefore based our replication solely on the experimental section. The authors mention:
For TMT-based quantification, corrected reporter ion intensities were extracted. Reporter ion intensities from single cells were normalized to the reference channel containing 0.2 ng of apeptide mixture from the three AML cell lines at PSM level using MaxQuant (v. 1.6.12.0). To minimize the batch effect from multiple TMT experiments, the relative abundances from 19 TMT plexes were log2-transformed and the data matrices from all of the TMT experiments were combined after sequentially centering the column and row values to their median values. A criterion of >70% valid values and at least two identified peptides per protein were required to generate the "quantifiable' protein list. The data matrix was further analyzed by Perseus for statistical analysis including princip[al] component analysis (PCA) and heatmap analysis.
Furthermore, the processed data by the authors is not provided, so we will benchmark the replication based on the figures in the article.
The authors however distribute the MaxQuant output that we have already
converted to the scp
data framework (@Vanderaa2021-ue). See the
scp
vignette
for more information about the framework. The QFeatures
object
containing the data sets from Williams et al. is available from the
scpdata
package. Let's load the required packages before starting the
replication.
library(scpdata) library(scp) library(ComplexHeatmap) library(scater)
We can now load the multiplexed (TMT) SCP data set from Williams et al.
This is performed by calling the williams2020_tmt()
function from
scpdata
(@Vanderaa2022-qv).
(williams <- williams2020_tmt())
The data contain 4 different SingleCellExperiment
objects that we
refer to as assays. Each assay contains expression data along with
feature metadata. Each row in an assay represents a feature that
is either a peptide (assay 1-2) or a protein (assay 3-4). The protein
and peptide data are split in two assays, containing either the summed
peptide/protein intensities, or the corrected (reference normalization)
quantifications. We also store the relationships between features of
different assays. For instance, peptides_intensity
are linked to
proteins_intensity
. We can plot these relationships and get an
overveiew of the data set.
plot(williams)
Each column (n = 209) in an assay represents a sample. We can
easily retrieve the sample annotations from the colData
.
colData(williams) table(williams$SampleType)
Most of the samples are single cells, either CMK, K562 or MOLM-14. The
remaining assays are either booster channels (Boost
) with 10ng of
protein lysate used to enhance peptide identification, reference
channels (Reference') with 0.2ng of protein lysate used for
reference normalization by MaxQuant, and empty channels ('Empty
) with
no material but containing spillover signal from the booster channel.
The first step of the workflow is to remove reverse hits and
contaminants. Before performing this step, we subsest the data to only
single-cell samples. This subset is performed on the Amount
variable
from the colData
, so we use subsetByColData()
.
(williams <- subsetByColData(williams, williams$Amount == "1cell"))
We are left with a data set containing 152 single-cells. We can now
filter out the reverse hits and contaminants. The required information
is contained in the rowData
of each assay (Reverse
and
Potential.contaminant
, respectively). We use filterFeatures()
to
filter on the rowData
.
(williams <- filterFeatures(williams, ~ Reverse != "+" & Potential.contaminant != "+"))
We are left with 17,735 peptides and 2,583 proteins that pass the quality control.
The authors next mention they log2-transformed the protein data. They
however do not mention how they deal with 0 values that will become
infinite after any logarithmic transformation. In this replication
study, we replace 0's with missing values using zeroIsNA()
.
williams <- zeroIsNA(williams, i = "proteins_corrected")
We then apply logTransform()
on the proteins_corrected
assay
containing the protein data normalized by MaxQuant.
(williams <- logTransform(williams, i = "proteins_corrected", name = "proteins_log", base = 2))
We stored the results in a new assay called proteins_log
.
The authors then further normalize the data by sequentially centering
the column and row values to their median values. Centering columns
to the median is readily available in normalizeSCP()
.
williams <- normalizeSCP(williams, i = "proteins_log", name = "proteins_norm", method = "center.median")
The author mention they perform row centering for each batch separately. This
functionality is not available from the package we rely on. So, we'll perform
this step manually. First, we extract the normalized assay along with the
associated colData
using getWithColData()
.
<>
sceNorm <- getWithColData(williams, "proteins_norm")
Then, we build a for loop that iterates through each batch and centers the rows based on the median within each batch.
for (batch in sceNorm$Batch) { ind <- which(sceNorm$Batch == batch) rowMeds <- rowMedians(assay(sceNorm[, ind]), na.rm = TRUE) assay(sceNorm[, ind]) <- sweep(assay(sceNorm[, ind]), FUN = "-", STATS = rowMeds, MARGIN = 1) }
Finally, we add the row-normalized assay back in the data set and add the relationships from the previous assay to this new assay.
williams <- addAssay(williams, sceNorm, name = "proteins_bc") williams <- addAssayLinkOneToOne(williams, from = "proteins_norm", to = "proteins_bc")
The authors remove proteins that have more than 30 \% missing data.
We use filterNA()
and provide the threshold of maximum 30\% missing values.
williams <- filterNA(williams, i = "proteins_bc", pNA = 0.3)
The authors next keep only proteins that are identified with at least
2 peptides. We can retrieve which proteins are made of several peptides
from the rowData
of the peptides_corrected
assay. The peptides were
mapped to the protein data thanks to the Leading.razor.protein
. All
proteins that are present at least 2 are kept.
## Get `Leading.razor.protein` from the peptide `rowData` prots <- rowData(williams[["peptides_corrected"]])$Leading.razor.protein ## Get all proteins that are present at least twice prots <- unique(prots[duplicated(prots)]) ## Keep only proteins that were kept after missing data filtering prots <- prots[prots %in% rownames(williams)[["proteins_bc"]]] head(prots)
We generated a vector with the proteins identified at least twice. We
can now use it to subset the data set. Note that a QFeatures
object
can be seen as a three-order array: $features \times samples \times assay$.
Hence, QFeatures
supports three-order subsetting x[rows, columns, assays]
.
The great advantage of this subsetting is that all peptides linked to
these desired proteins are also retrieved thanks to the relationships
stored between proteins and peptides.
(williams <- williams[prots, , ])
This step concludes the quantitative data processing. We update the
QFeatures
plot that provides an overview of the processing.
plot(williams)
The authors used the processed data to perform two types of downstream analysis: single-cell correlation analysis and principal component analysis. We here provide the published figure that we aim to reproduce in this vignette.
knitr::include_graphics("figs/williams2020-figure6.png", error = FALSE)
To run the downstream analysis, we only need the last assay generated
during the quantitative data processing. We extract the assay as well
as the associated sample annotations using getWithColData()
.
sce <- getWithColData(williams, "proteins_bc")
We first compute the correlation matrix. We use pairwise
m <- assay(sce) corMat <- cor(m, method = "pearson", use = "pairwise.complete.obs")
We use the ComplexHeatmap
package to plot the heatmap. This package
allows easy annotation of the plot, namely labeling each sample with
its sample type and channel.
cols <- c(CMK = "green4", `K562` = "dodgerblue", `MOLM-14` = "red3") colAnnot <- columnAnnotation(Population = sce$SampleType, Channel = sce$Channel, col = list(Population = cols))
We can now plot the heatmap.
Heatmap(corMat, name = "Pearson correlation", show_row_names = FALSE, show_column_names = FALSE, top_annotation = colAnnot)
Like in the figure from the article, we observe a positive correlation between samples from the same cell type. We also observe with sample type clusters. While this is not shown in the original figure, we can clearly link this effect to TMT channel bias. This should be corrected for, but a the poor experimental design causes sample type and TMT channel to be confounded, and hence entangling the TMT effects with bioligical effects.
PCA can directly be computed from a SingleCellExperiment
object thanks
to the scater
package. We therefore use runPCA()
. Note that we also
impute the missing values with zeros as PCA does not support missing
value.
sce <- runPCA(sce, exprs_values = 1)
We generate the PCA plots using plotPCA()
, and use different colouring
schemes to highlight different samples annotations.
plotPCA(sce, colour_by = "SampleType")
The sample types can be differentiated on the PCA plot, although the CMK and MOLM-14 slightly overlap what is not observed in the published work. Another small difference is that the first to PC explaine 15 \% of the variance while the authors report 22 \% explained variance.
We were able to reproduce the general results from the analysis by @Williams2020-ty. However, we notice small discrepancies indicating small differences between workflows. Without a script or the intermediate results from the authors, we are unable to pinpoint which steps differ.
You can reproduce this vignette using Docker
:
docker pull cvanderaa/scp_replication_docker:v1 docker run \ -e PASSWORD=bioc \ -p 8787:8787 \ cvanderaa/scp_replication_docker:v1
Open your browser and go to http://localhost:8787. The USER is rstudio
and
the password is bioc
. You can find the vignette in the vignettes
folder.
See the website home page for more information.
The system details of the machine that built the vignette are:
library(benchmarkme) sd <- get_sys_details() cat("Machine: ", sd$sys_info$sysname, " (", sd$sys_info$release, ")\n", "R version: R.", sd$r_version$major, ".", sd$r_version$minor, " (svn: ", sd$r_version$`svn rev`, ")\n", "RAM: ", round(sd$ram / 1E9, 1), " GB\n", "CPU: ", sd$cpu$no_of_cores, " core(s) - ", sd$cpu$model_name, "\n", sep = "")
The total time required to compile this vignette is:
timing <- Sys.time() - timeStart cat(timing[[1]], attr(timing, "units"))
The final williams
memory size is:
format(object.size(williams), units = "GB")
sessionInfo()
This vignette is distributed under a CC BY-SA licence licence.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.