Bram Burger and Marc Vaudel 2019-07-30
This notebook presents the usage of colvano to annotate a Volcano Plot.
This notebook was written in R version 3.6.1 using the following libraries:
More details on the session at the end.
Note: The libraries are assumed to be installed, please refer to the respective packages for installation procedures.
library(tidyr)
library(dplyr)
library(ggplot2)
library(scico)
library(grid)
library(gtable)
library(ggforce)
library(purrr)
library(gamlss)
library(limma)
theme_set(theme_bw(base_size = 13))
yeastColor <- scico(n = 1, palette = "berlin", begin = 0.15, end = 0.15)
upsColor <- scico(n = 1, palette = "berlin", begin = 0.85, end = 0.85)
The dataset used here was published by Ramus et al. (1), it is freely available through the ProteomeXchange (2) consortium via the PRIDE (3) partner repository under the accession number PXD001819. The summary table of the quantification results is available in the resources folder.
The protein group export of MaxQuant (4) from the original publication (1) is used here and contains the iBAQ abundance estimates for each protein group.
proteinGroupsInput <- read.table(file = "resources/mq.txt", header = T, stringsAsFactors = F, sep = "\t") %>%
filter(Majority.protein.IDs != "") %>%
mutate(row = row_number()) %>%
rename(
comparison = Original.comparison,
species = Species,
groupId = Majority.protein.IDs
) %>%
dplyr::select(
row, comparison, groupId, species, A1, A2, A3, B1, B2, B3
)
Briefly, the authors spiked a mix of proteins called the Universal Proteomics Standard (UPS) in a background of yeast proteins. The UPS proteins are spiked at different concentrations between samples while the yeast proteins stay the same. For example, 25 vs. 12.5 indicates that UPS proteins were spiked in 25 fmol per µg of yeast lysate in A and 12.5 fmol per µg of yeast lysate in B, hence the expected ratio is 2:1 and 1:1 for UPS and yeast proteins, respectively.
The number of protein groups with leading proteins from UPS or yeast are shown below.
table(proteinGroupsInput$species, proteinGroupsInput$comparison)
##
## 25Vs12.5 50Vs0.5 50Vs5
## UPS 47 48 48
## yeast 821 821 840
The columns with quantitative information are indicated as A1, A2, A3, B1, B2, and B3, where A and B represent the different conditions of the first column and 1, 2, and, 3 biological replicates. We can view these values by plotting the intensities of each experiment against each others.
# Gather the values to plot against each others' in two data frames and merge by protein group identifier
proteinGroupsInput %>%
gather(
A1, A2, A3, B1, B2, B3,
key = "sample",
value = "x"
) %>%
mutate(
x = log10(x),
conditionX = substr(sample, 1, 1),
replicateX = substr(sample, 2, 2)
) %>%
dplyr::select(
row, conditionX, replicateX, x
) -> plotDF1
proteinGroupsInput %>%
gather(
A1, A2, A3, B1, B2, B3,
key = "sample",
value = "y"
) %>%
mutate(
y = log10(y),
conditionY = substr(sample, 1, 1),
replicateY = substr(sample, 2, 2)
) %>%
dplyr::select(
row, species, comparison, conditionY, replicateY, y
) %>%
left_join(
plotDF1,
by = "row"
) %>%
filter(
!is.na(x) & !is.na(y)
) %>%
filter(
conditionX == "B" & conditionY == "A" | conditionX == conditionY & replicateX >= replicateY
) %>% mutate(
comparison = factor(comparison),
species = factor(species, levels = c("yeast", "UPS"))
) %>%
arrange(
species
) -> plotDF
levels(plotDF$species) <- c("Yeast", "UPS")
comparisonLevels <- levels(plotDF$comparison)
# Plot all samples for each comparison against each other
plotGrob <- NULL
for (n in 1:length(comparisonLevels)) {
comparisonLevel <- comparisonLevels[n]
plotDF %>%
filter(
comparison == comparisonLevel
) -> tempPlot
plot <- ggplot(
data = tempPlot
) +
geom_abline(
slope = 1,
intercept = 0,
linetype = "dotted"
) +
geom_point(
mapping = aes(x = x, y = y, col = species)
) +
scale_color_manual(
name = NULL,
values = c(yeastColor, upsColor)
) +
scale_x_continuous(
name = "Intensity [log10]",
position = "top"
) +
scale_y_continuous(
name = "Intensity [log10]",
position = "right"
) +
facet_grid(
conditionY + replicateY ~ conditionX + replicateX,
switch = "both"
) +
ggtitle(
label = comparisonLevel
) +
theme(
legend.position = "bottom",
strip.background = element_rect(fill = "grey90"),
panel.grid.minor = element_blank(),
plot.title = element_text(hjust = 1)
)
tempGrob <- ggplotGrob(plot)
# Remove plots below the diagonal
for (i in 2:6) {
for (j in 1:(i-1)) {
panelName <- paste("panel", i, j, sep = "-")
k <- grep(pattern = panelName, tempGrob$layout$name)
tempGrob$grobs[[k]] <- nullGrob()
}
}
if (is.null(plotGrob)) {
plotGrob <- tempGrob
} else {
plotGrob <- rbind(plotGrob, tempGrob, size = "first")
}
}
# Save as image that is not default size
png("proteomics_ups_files/scatter.png", width = 800, height = length(comparisonLevels) * 600)
grid.draw(plotGrob)
dummy <- dev.off()
Here, we can see that intensities are very homogeneous among replicates, where both yeast and UPS proteins distribute along the diagonal. When plotting A vs. B samples, the cloud of yeast proteins around the diagonal widens, and the UPS proteins stand out to the upper side. We can also notice that UPS proteins are missing in the 50 vs. 0.5 comparison in sample B1, but not others.
In the following, we plot the share of missing values in every experiment.
proteinGroupsInput %>%
gather(
A1, A2, A3, B1, B2, B3,
key = "sample",
value = "intensity"
) %>%
dplyr::select(
comparison, sample, species, intensity
) -> tempDF
tempDF %>%
group_by(
comparison, sample
) %>%
summarise(
nvalues = n()
) %>%
ungroup() %>%
mutate(
comparison = factor(comparison),
sample = factor(sample)
) -> nValuesDF
tempDF %>%
group_by(
comparison, sample, species
) %>%
summarise(
nMissing = sum(is.na(intensity))
) %>%
ungroup() %>%
mutate(
comparison = factor(comparison),
sample = factor(sample)
) %>%
left_join(
nValuesDF,
by = c("comparison", "sample")
) %>%
mutate(
shareMissing = round(
100 * nMissing / nvalues,
digits = 2
),
species = factor(species, levels = c("UPS", "yeast"))
) -> missingDF
levels(missingDF$species) <- c("UPS", "Yeast")
ggplot(data = missingDF) +
geom_col(
mapping = aes(x = sample, y = shareMissing, fill = species)
) +
facet_grid(
comparison ~ .
) +
scale_fill_manual(
name = NULL,
values = c(upsColor, yeastColor)
) +
scale_y_continuous(
name = "Share of missing values [%]"
) +
theme(
strip.background = element_rect(fill = "grey90"),
axis.title.x = element_blank(),
panel.grid.major.x = element_blank()
)
As expected from the above, we observe that sample B1 stands out and presents a much higher prevalence of missing values. We can also observe that sample B presents a higher prevalence of missing values in comparison 50 vs. 0.5, mainly driven by the UPS proteins.
Here, we retain only protein groups quantified in at least two replicates in each condition for each comparison.
proteinGroupsInput %>%
mutate(
nMissingA = as.numeric(
pmap(
list(
A1, A2, A3
),
~sum(is.na(c(..1, ..2, ..3)))
)
),
nMissingB = as.numeric(
pmap(
list(
B1, B2, B3
),
~sum(is.na(c(..1, ..2, ..3)))
)
)
) -> proteinGroupsInput
proteinGroupsInput %>%
filter(
nMissingA > 1 | nMissingB > 1
) -> excludedProteins
nExcluded <- as.data.frame(table(excludedProteins$comparison))
names(nExcluded) <- c("comparison", "nExcluded")
nTotal <- as.data.frame(table(proteinGroupsInput$comparison))
names(nTotal) <- c("comparison", "nTotal")
nExcluded <- merge(nExcluded, nTotal, by = "comparison")
nExcluded$shareExcludedPercent <- round(100 * nExcluded$nExcluded / nExcluded$nTotal, digits = 2)
nExcluded <- nExcluded[, names(nExcluded) != "nTotal"]
print(nExcluded)
## comparison nExcluded shareExcludedPercent
## 1 25Vs12.5 9 1.04
## 2 50Vs0.5 58 6.67
## 3 50Vs5 30 3.38
proteinGroupsInput %>%
filter(
nMissingA <= 1 & nMissingB <= 1
) -> proteinGroupsInput
In the following, we plot the intensity distribution for every sample. Note that we focus on proteins with no missing values.
# Gather the values to plot
proteinGroupsInput %>%
filter(
nMissingA == 0 & nMissingB ==0
) %>%
gather(
A1, A2, A3, B1, B2, B3,
key = "sample",
value = "intensity"
) %>%
filter(
!is.na(intensity)
) %>%
mutate(
intensity = log10(intensity),
condition = substr(sample, 1, 1),
replicate = substr(sample, 2, 2)
) %>%
dplyr::select(
comparison, condition, replicate, intensity
) -> plotDF
plotDF$comparison <- factor(plotDF$comparison)
# Make a sina + box plot
ggplot(data = plotDF) +
geom_sina(
mapping = aes(x = 0, y = intensity),
col = "grey80"
) +
facet_grid(
comparison ~ condition + replicate
) +
geom_boxplot(
mapping = aes(x = 0, y = intensity),
outlier.shape = NA
) +
scale_x_continuous(
breaks = c(0)
) +
scale_y_continuous(
name = "Intensity [log10]"
) +
theme(
axis.title.x = element_blank(),
axis.text.x = element_blank(),
axis.ticks.x = element_blank(),
strip.background = element_rect(fill = "grey90")
)
plotDF %>%
group_by(
comparison, condition, replicate
) %>%
summarise(
medianIntensity = round(median(intensity), digits = 2)
) %>%
ungroup() -> summaryDF
summaryDF
## # A tibble: 18 x 4
## comparison condition replicate medianIntensity
## <fct> <chr> <chr> <dbl>
## 1 25Vs12.5 A 1 7.61
## 2 25Vs12.5 A 2 7.63
## 3 25Vs12.5 A 3 7.61
## 4 25Vs12.5 B 1 7.58
## 5 25Vs12.5 B 2 7.6
## 6 25Vs12.5 B 3 7.61
## 7 50Vs0.5 A 1 7.68
## 8 50Vs0.5 A 2 7.67
## 9 50Vs0.5 A 3 7.67
## 10 50Vs0.5 B 1 7.67
## 11 50Vs0.5 B 2 7.67
## 12 50Vs0.5 B 3 7.68
## 13 50Vs5 A 1 7.64
## 14 50Vs5 A 2 7.63
## 15 50Vs5 A 3 7.63
## 16 50Vs5 B 1 7.59
## 17 50Vs5 B 2 7.62
## 18 50Vs5 B 3 7.63
In this plot, we can see that the distribution of iBAQ intensities is homogeneous between the samples. Minor deviations can however be observed, notably for sample B replicate 1 where the median of instensities is lower than the other samples.
The higher prevalence of missing values and lower intensity of proteins quantified in all experiments hints at a lower loading of sample B1. In order to correct for loading differences, we standardize the intensities per sample. Note that we use here a very simple model, with a Gaussian distribution of the logarithm base 10 of the intensities and no covariates.
comparisons <- unique(proteinGroupsInput$comparison)
for (condition in c("A", "B")) {
for (replicate in 1:3) {
sample <- paste0(condition, replicate)
sampleStandardized <- paste0(sample, "_standardized")
for (comparison in comparisons) {
is <- proteinGroupsInput$comparison == comparison
valuesDF <- data.frame(
x = 0,
y = log10(proteinGroupsInput[[sample]][is]),
nMissing = proteinGroupsInput[is, "nMissingA"] + proteinGroupsInput[is, "nMissingB"],
stringsAsFactors = F
)
trainingDF <- valuesDF %>%
filter(
!is.na(y) & nMissing == 0
)
model <- gamlss(formula = y ~ x,
family = NO,
data = trainingDF)
scaledValues <- centiles.pred(obj = model, xname = "x", xvalues = valuesDF$x, yval = valuesDF$y, type = "z-scores")
proteinGroupsInput[[sampleStandardized]][is] <- scaledValues
}
}
}
## GAMLSS-RS iteration 1: Global Deviance = 1429.405
## GAMLSS-RS iteration 2: Global Deviance = 1429.405
## GAMLSS-RS iteration 1: Global Deviance = 1508.967
## GAMLSS-RS iteration 2: Global Deviance = 1508.967
## GAMLSS-RS iteration 1: Global Deviance = 1539.353
## GAMLSS-RS iteration 2: Global Deviance = 1539.353
## GAMLSS-RS iteration 1: Global Deviance = 1432.382
## GAMLSS-RS iteration 2: Global Deviance = 1432.382
## GAMLSS-RS iteration 1: Global Deviance = 1517.012
## GAMLSS-RS iteration 2: Global Deviance = 1517.012
## GAMLSS-RS iteration 1: Global Deviance = 1527.815
## GAMLSS-RS iteration 2: Global Deviance = 1527.815
## GAMLSS-RS iteration 1: Global Deviance = 1427.782
## GAMLSS-RS iteration 2: Global Deviance = 1427.782
## GAMLSS-RS iteration 1: Global Deviance = 1515.392
## GAMLSS-RS iteration 2: Global Deviance = 1515.392
## GAMLSS-RS iteration 1: Global Deviance = 1521.046
## GAMLSS-RS iteration 2: Global Deviance = 1521.046
## GAMLSS-RS iteration 1: Global Deviance = 1418.8
## GAMLSS-RS iteration 2: Global Deviance = 1418.8
## GAMLSS-RS iteration 1: Global Deviance = 1469.824
## GAMLSS-RS iteration 2: Global Deviance = 1469.824
## GAMLSS-RS iteration 1: Global Deviance = 1534.622
## GAMLSS-RS iteration 2: Global Deviance = 1534.622
## GAMLSS-RS iteration 1: Global Deviance = 1421.352
## GAMLSS-RS iteration 2: Global Deviance = 1421.352
## GAMLSS-RS iteration 1: Global Deviance = 1476.578
## GAMLSS-RS iteration 2: Global Deviance = 1476.578
## GAMLSS-RS iteration 1: Global Deviance = 1532.307
## GAMLSS-RS iteration 2: Global Deviance = 1532.307
## GAMLSS-RS iteration 1: Global Deviance = 1418.785
## GAMLSS-RS iteration 2: Global Deviance = 1418.785
## GAMLSS-RS iteration 1: Global Deviance = 1476.628
## GAMLSS-RS iteration 2: Global Deviance = 1476.628
## GAMLSS-RS iteration 1: Global Deviance = 1531.049
## GAMLSS-RS iteration 2: Global Deviance = 1531.049
We can view the result of the standardization in scatter plots.
proteinGroupsInput %>%
gather(
A1_standardized, A2_standardized, A3_standardized, B1_standardized, B2_standardized, B3_standardized,
key = "sample",
value = "y"
) %>%
mutate(
sample = substr(sample, start = 1, stop = 2)
) %>%
dplyr::select(
row, sample, y
) -> plotDFY
proteinGroupsInput %>%
gather(
A1, A2, A3, B1, B2, B3,
key = "sample",
value = "x"
) %>%
left_join(
plotDFY,
by = c("sample", "row")
) %>%
filter(
!is.na(x) & !is.na(y)
) %>%
mutate(
x = log10(x),
condition = substr(sample, 1, 1),
replicate = substr(sample, 2, 2),
species = factor(species, levels = c("yeast", "UPS"))
) %>%
dplyr::select(
row, comparison, species, condition, replicate, x, y
) %>%
mutate(
comparison = factor(comparison),
) %>%
arrange(
species
) -> plotDF
levels(plotDF$species) <- c("Yeast", "UPS")
ggplot(data = plotDF) +
geom_point(
mapping = aes(x = x, y = y, col = species)
) +
scale_color_manual(
name = NULL,
values = c(yeastColor, upsColor)
) +
scale_x_continuous(
name = "Intensity [log10]"
) +
scale_y_continuous(
name = "Standardized Intensity"
) +
facet_grid(
condition + replicate ~ comparison
) +
theme(
legend.position = "top",
strip.background = element_rect(fill = "grey90")
)
And then the Sina and Box plots of the standardized values.
# Gather the values to plot
proteinGroupsInput %>%
filter(
nMissingA == 0 & nMissingB ==0
) %>%
gather(
A1_standardized, A2_standardized, A3_standardized, B1_standardized, B2_standardized, B3_standardized,
key = "sample",
value = "intensity"
) %>%
filter(
!is.na(intensity)
) %>%
mutate(
condition = substr(sample, 1, 1),
replicate = substr(sample, 2, 2)
) %>%
dplyr::select(
comparison, condition, replicate, intensity
) -> plotDF
plotDF$comparison <- factor(plotDF$comparison)
# Make a sina + box plot
ggplot(data = plotDF) +
geom_sina(
mapping = aes(x = 0, y = intensity),
col = "grey80"
) +
facet_grid(
comparison ~ condition + replicate
) +
geom_boxplot(
mapping = aes(x = 0, y = intensity),
outlier.shape = NA
) +
scale_x_continuous(
breaks = c(0)
) +
scale_y_continuous(
name = "Standardized Intensity"
) +
theme(
axis.title.x = element_blank(),
axis.text.x = element_blank(),
axis.ticks.x = element_blank(),
strip.background = element_rect(fill = "grey90")
)
plotDF %>%
group_by(
comparison, condition, replicate
) %>%
summarise(
medianIntensity = round(median(intensity), digits = 2)
) %>%
ungroup() -> summaryDF
summaryDF
## # A tibble: 18 x 4
## comparison condition replicate medianIntensity
## <fct> <chr> <chr> <dbl>
## 1 25Vs12.5 A 1 -0.03
## 2 25Vs12.5 A 2 -0.02
## 3 25Vs12.5 A 3 -0.04
## 4 25Vs12.5 B 1 -0.06
## 5 25Vs12.5 B 2 -0.04
## 6 25Vs12.5 B 3 -0.03
## 7 50Vs0.5 A 1 0.02
## 8 50Vs0.5 A 2 0
## 9 50Vs0.5 A 3 0.01
## 10 50Vs0.5 B 1 0
## 11 50Vs0.5 B 2 -0.01
## 12 50Vs0.5 B 3 0
## 13 50Vs5 A 1 0
## 14 50Vs5 A 2 -0.01
## 15 50Vs5 A 3 -0.02
## 16 50Vs5 B 1 -0.05
## 17 50Vs5 B 2 -0.04
## 18 50Vs5 B 3 -0.03
The distributions of intensities are now aligned.
In the following, we test the significance of the difference between A and B for each protein.
comparisonLevels <- unique(proteinGroupsInput$comparison)
limmaDF <- NULL
for (comparisonLevel in comparisonLevels) {
comparisonData <- proteinGroupsInput %>%
filter(
comparison == comparisonLevel
)
colNames <- c("A1", "A2", "A3", "B1", "B2", "B3")
rowNames <- comparisonData$groupId
valuesMatrix <- as.matrix(comparisonData[, paste0(colNames, "_standardized")])
colnames(valuesMatrix) <- colNames
rownames(valuesMatrix) <- rowNames
conditions <- substr(colNames, start = 1, stop = 1)
conditions <- factor(conditions, levels = c("A", "B"))
design <- model.matrix(~ conditions)
fit <- lmFit(
object = valuesMatrix,
design = design
)
fit <- eBayes(fit)
comparisonResult <- topTable(fit, n = length(rowNames))
comparisonDF <- as.data.frame(comparisonResult)
comparisonDF$groupId <- rownames(comparisonResult)
comparisonDF$comparison <- comparisonLevel
comparisonDF %>%
left_join(
comparisonData %>%
dplyr::select(groupId, species),
by = "groupId"
) -> comparisonDF
if (is.null(limmaDF)) {
limmaDF <- comparisonDF
} else {
limmaDF <- rbind(limmaDF, comparisonDF)
}
}
## Removing intercept from test coefficients
## Removing intercept from test coefficients
## Removing intercept from test coefficients
These results are typically inspected in a Volcano plot, where the inverse of the log-transformed adjusted p-value is plotted against the difference between the means of the logged (and here standardized) intensities.
limmaDF %>%
mutate(
species = factor(species, c("yeast", "UPS")),
pLog = -log10(adj.P.Val)
) %>%
arrange(
species
) -> plotDF
levels(plotDF$species) <- c("Yeast", "UPS")
volcanoPlot <- ggplot(data = plotDF) +
geom_point(
mapping = aes(x = logFC, y = pLog, col = species)
) +
scale_color_manual(
name = NULL,
values = c(yeastColor, upsColor)
) +
scale_y_continuous(
name = "Adjusted P-value [-log10]"
) +
scale_x_continuous(
name = "Mean Difference [log10 standardized]"
) +
facet_grid(
comparison ~ .
) +
theme(
legend.position = "top",
strip.background = element_rect(fill = "grey90")
)
# Save as image that is not default size
png("proteomics_ups_files/volcano.png", width = 800, height = length(comparisonLevels) * 600)
grid.draw(volcanoPlot)
dummy <- dev.off()
1) Spiked proteomic standard dataset for testing label-free quantitative software and statistical methods 2) ProteomeXchange provides globally coordinated proteomics data submission and dissemination 3) PRIDE: the proteomics identifications database 4) MaxQuant enables high peptide identification rates, individualized p.p.b. range mass accuracies and proteome-wide protein quantification
sessionInfo()
## R version 3.6.1 (2019-07-05)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 10 x64 (build 17134)
##
## Matrix products: default
##
## locale:
## [1] LC_COLLATE=Norwegian Bokmål_Norway.1252
## [2] LC_CTYPE=Norwegian Bokmål_Norway.1252
## [3] LC_MONETARY=Norwegian Bokmål_Norway.1252
## [4] LC_NUMERIC=C
## [5] LC_TIME=Norwegian Bokmål_Norway.1252
##
## attached base packages:
## [1] parallel splines grid stats graphics grDevices utils
## [8] datasets methods base
##
## other attached packages:
## [1] limma_3.40.6 gamlss_5.1-4 nlme_3.1-140
## [4] gamlss.dist_5.1-4 MASS_7.3-51.4 gamlss.data_5.1-4
## [7] purrr_0.3.2 ggforce_0.2.2 gtable_0.3.0
## [10] scico_1.1.0 ggplot2_3.2.0 dplyr_0.8.3
## [13] tidyr_0.8.3
##
## loaded via a namespace (and not attached):
## [1] tidyselect_0.2.5 xfun_0.8 reshape2_1.4.3
## [4] lattice_0.20-38 colorspace_1.4-1 vctrs_0.2.0
## [7] htmltools_0.3.6 yaml_2.2.0 utf8_1.1.4
## [10] survival_2.44-1.1 rlang_0.4.0 pillar_1.4.2
## [13] glue_1.3.1 withr_2.1.2 tweenr_1.0.1
## [16] plyr_1.8.4 stringr_1.4.0 munsell_0.5.0
## [19] evaluate_0.14 labeling_0.3 knitr_1.23
## [22] fansi_0.4.0 Rcpp_1.0.2 scales_1.0.0
## [25] backports_1.1.4 farver_1.1.0 digest_0.6.20
## [28] stringi_1.4.3 polyclip_1.10-0 cli_1.1.0
## [31] tools_3.6.1 magrittr_1.5 lazyeval_0.2.2
## [34] tibble_2.1.3 crayon_1.3.4 pkgconfig_2.0.2
## [37] zeallot_0.1.0 Matrix_1.2-17 assertthat_0.2.1
## [40] rmarkdown_1.14 R6_2.4.0 compiler_3.6.1
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.