In MRCToxBioinformatics/Proteomics_data_analysis: Basic proteomics data analysis tutorials

Load dependencies

Load the required libraries.

library(camprotR)
library(MSnbase)
library(ggplot2)
library(tidyr)
library(dplyr)
library(here)
library(limma)
library(uniprotREST)

Preamble

The correct approach for statistical testing for changes in phosphorylation will depend on the details of the experimental design. Here, we have quantified phospho and unmodified peptides, so we can identify changes in phosphorylation which are not explained by changes in the unmodified protein, e.g changes in the proportion of the protein that is phosphorylated.

To acheive this, we will jointly model the phospho and unmodified peptides which overlap each phosphosite, using a linear model with an interaction term.

Input data

We start by reading in the MSnSets created in the previous notebook QC PSM-level quantification, filtering and summarisation to protein-level abundance

total <- readRDS(here('results/total.rds'))
phospho_sites <- readRDS(here('results/phospho_sites.rds'))

Below, we identify the total peptides that overlap each phosphosite. Note that each phosphosite may intersect multiple total peptides (due to missed cleavages) and each total peptide may intersect multiple phosphosites (due to multiple phosphosites per peptide).

phospho_f <- phospho_sites %>% fData() %>% tibble::rownames_to_column('ID')
total_f <- total %>% fData() %>% tibble::rownames_to_column('ID')

phospho_peptide_to_total_peptides <- vector('list', length=length(phospho_f$ID))
names(phospho_peptide_to_total_peptides) <- phospho_f$ID
n <- 0
for(phospho_id in phospho_f$ID){
  row <- phospho_f %>% filter(ID==phospho_id)

  n <- n + 1
  ptm_positions <- as.numeric(strsplit(row$ptm_position, split='\\|')[[1]])

  total_peptide_ids <- total_f %>%
    filter(Master.Protein.Accessions %in% row$Master.Protein.Accessions,
           peptide_start<=min(ptm_positions),
           peptide_end>=max(ptm_positions)) %>%
    pull(ID)  

  phospho_peptide_to_total_peptides[[row$ID]] <- total_peptide_ids
}

Now, how many overlapping features. Note that for the phospho and total to be considered intersecting, there must be at least one total peptide which overlaps the phosphosite. Out of the r length(phospho_f$ID) phosphosites, only r sum(sapply(phospho_peptide_to_total_peptides, function(x) length(x)>0)) have at least one overlapping total peptide.

print(length(phospho_f$ID))
print(table(sapply(phospho_peptide_to_total_peptides, function(x) length(x)>0)))
print(table(sapply(phospho_peptide_to_total_peptides, length)))

limma requires a single row per feature with a columns per sample. Here, the phospho and total are separate samples so we need to combine them into a single row.

combined_exprs <- phospho_peptide_to_total_peptides %>% names() %>% lapply(function(x){
  total_peptides <- phospho_peptide_to_total_peptides[[x]]

  if(length(total_peptides)>0) {
    exprs_values <- c(exprs(phospho_sites[x,]),
                      colSums(exprs(total[total_peptides,]), na.rm=TRUE))
    exprs_values <- unname(exprs_values)
    return(exprs_values)
  } else { return(NULL) }
})
names(combined_exprs) <- names(phospho_peptide_to_total_peptides)

Create the completed matrix of phospho and total.

combined_exprs <- combined_exprs[sapply(combined_exprs, function(x) !is.null(x))]
all_combined_exprs <- bind_rows(combined_exprs) %>% t() %>% as.matrix

rownames(all_combined_exprs) <- names(combined_exprs)

colnames(all_combined_exprs) <- c(paste0(colnames(phospho_sites), '_phospho'),
                                  paste0(colnames(total), '_total'))

Inspect the combined quantification matrix.

head(all_combined_exprs) #

We also need to create feature data and phenotype data (experimental information) for our combined quantification matrix. From these, we can then create an MSnSet to hold all the combined quantification data.

all_combined_fdata <- fData(phospho_sites[rownames(all_combined_exprs),])[
  ,c('Master.Protein.Accessions', 'ptm_position')]
all_combined_fdata$n_total <- sapply(rownames(all_combined_exprs),
                                     function(x) length(phospho_peptide_to_total_peptides[[x]]))

all_combined_pdata <- rbind((pData(phospho_sites) %>% mutate(type='phospho')),
                            (pData(total) %>% mutate(type='total')))
rownames(all_combined_pdata) <- colnames(all_combined_exprs)
all_combined_res <- MSnSet(as.matrix.data.frame(all_combined_exprs),
                           all_combined_fdata,
                           all_combined_pdata)

head(fData(all_combined_res))
pData(all_combined_res)

Below, we subset to the first 2 'conditions', where condition 1 = tags 1-4 (x1 phospho; x1 total) and condition 2 = tags 5-7 (x6 phospho; x2 total). The ground truth for the difference between condition 2 and condition 1 is therefore a 3-fold increase in phosphorylation for the yeast proteins and reduced phosphorylation for the human proteins. For the human proteins, given the amounts of spike in yeast and balancing human proteins to make up the labelled material, we expect a 30% drop in phosphorylation.

pairwise_comparison_combined_res <- all_combined_res[,pData(all_combined_res)$condition %in% 1:2]
pData(pairwise_comparison_combined_res)

Run limma

Below, we perform the limma analysis. First though, some clarification on the model we're using.

We have two experimental variables:

type: phospho or total
condition: 1x or 2x spike in

type <- factor(pData(pairwise_comparison_combined_res)$type, level=c('total', 'phospho'))
condition <- factor(pData(pairwise_comparison_combined_res)$condition, levels=1:2)
print(paste(type, condition))

A simple additive model would estimate independent effects for the type and condition variables

# model without interaction term
print(model.matrix(~type+condition))

If we want to explore a possible combinatorial effect, we need to also include a type:condition interaction term. In our case, this captures the difference in abundance for phosphopeptides between the conditions, which is specific for phosphopeptides and not seen in the unmodified peptides.

# model with an interaction term
print(model.matrix(~type+condition+type:condition))

Below, we run the linear modeling. For more details about limma, see the Differential abundance testing for TMT proteomics notebook.

dat <- exprs(pairwise_comparison_combined_res) %>% log(base=2)
study.design <- model.matrix(~type*condition)

fit <- lmFit(dat, study.design)
fit <- eBayes(fit, trend=TRUE)

limma.results <- topTable(fit, coef = colnames(fit$coefficients)[4], n = Inf, confint=TRUE)
limma.results$sigma <- fit$sigma

We would like to add information about the species to make sure the fold-changes are in the expected direction. We'll use the uniprotREST package to query

limma.results$protein <- rownames(limma.results) %>%
  strsplit(split='_') %>%
  sapply('[[', 1) 

species_res <-  uniprot_map(
  ids = unique(limma.results$protein),
  from = "UniProtKB_AC-ID",
  to = "UniProtKB",
  fields = "organism_name") %>%
  mutate(species=recode(Organism,
                        "Saccharomyces cerevisiae (strain ATCC 204508 / S288c) (Baker's yeast)"="S.cerevisiae",
                        "Homo sapiens (Human)"="H.sapiens"))

Below, we plot the limma results using a volcano plot, with two panels, one for each species.

limma.results %>%
  merge(species_res, by.x='protein', by.y='From') %>%
  ggplot(aes(logFC, -log10(P.Value), fill=adj.P.Val<0.01)) +
  geom_point(pch=21, size=2, colour='grey70') +
  facet_wrap(~species) +
  theme_camprot(base_size=15, base_family='sans', border=FALSE) +
  theme(strip.background=element_blank()) +
  xlab('Log2 fold change') +
  ylab('-log10(p-value)') +
  scale_fill_manual(values=c('grey90', get_cat_palette(2)[2]), name='FDR < 1%')

The direction of change is as expected (reduced for human, increased for yeast), but only a subset of the fold-changes reach a 1% FDR threshold for significance. The fold-changes are appoximately what we would expect too: human phosphosites should be reduced by 30% = r round(log2(0.7), 2) on a log2 scale and the yeast phosphosites should be increased by 3-fold = r round(log2(3), 2) on a log2 scale.

MRCToxBioinformatics/Proteomics_data_analysis documentation built on July 7, 2024, 6:44 p.m.

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

MRCToxBioinformatics/Proteomics_data_analysis
Basic proteomics data analysis tutorials

In MRCToxBioinformatics/Proteomics_data_analysis: Basic proteomics data analysis tutorials

Load dependencies

Preamble

Input data

Run limma

R Package Documentation

Browse R Packages

We want your feedback!

MRCToxBioinformatics/Proteomics_data_analysis Basic proteomics data analysis tutorials

In MRCToxBioinformatics/Proteomics_data_analysis: Basic proteomics data analysis tutorials

Load dependencies

Preamble

Input data

Run limma

R Package Documentation

Browse R Packages

We want your feedback!

MRCToxBioinformatics/Proteomics_data_analysis
Basic proteomics data analysis tutorials