knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)

After analyzing your samples on a mass spectrometer, you want to find out if there are interesting patterns in the data. But the first challenge is how do you get the data from the files that your machine produces into R?

In the following, I will describe several ways of importing data from MaxQuant. MaxQuant is a popular tool for identifying, integrating, and combining MS peaks to derive peptide and protein intensities. MaxQuant produces serveral output files including proteinGroups.txt. It is usually a tab separated table with a lot of different columns, which can make it difficult to not get overwhelmed with information.

The most important columns that every proteinGroups.txt file contains are

Our goal is to turn this complicated matrix into a useable matrix or a SummarizedExperiment object. There are several ways to achieve this, which I will explain in greater detail:

  1. Use the base R functions (read.delim() and [<-) to read in the data
  2. Use the tidyverse to load the file and turn it into a useable object
  3. Use the proteus package and the readProteinGroups() function
  4. Use the DEP package and the import_MaxQuant() function

This package comes with an example file that contains the LFQ data from a BioID experiment in Drosophila melanogaster. 11 different Palmitoyltransferases (short DHHC) were tagged with a promiscuous biotin ligase and all biotinylated proteins were enriched and identified using label-free mass spectrometry. The conditions are named after the tagged DHHC and the negetive control condition is called S2R for the cell line. Each condition was measured in triplicates, which means that there are a total of 36 conditions. To make the file smaller, I provide a reduced data set which only contains the first 122 rows of the data.

Base R

The example file is located in

system.file("extdata/proteinGroups.txt", 
            package = "proDD", mustWork = TRUE)

In this specific file, all spaces have been replaced with dots. This is an example how each output file from MaxQuant slightly differs. This can make it difficult to write a generic import function. Instead I will first demonstrate the most general approach which is to simply use the base R tools for loading the data and turning it into useful objects.

The first step is to load the full table.

full_data <- read.delim(
    system.file("extdata/proteinGroups.txt", 
                package = "proDD", mustWork = TRUE),
    stringsAsFactors = FALSE
)

head(colnames(full_data))

Next, I create a matrix of the intensity data, where each sample is a column and each protein group is a row.

# I use a regular expression (regex) to select the intensity columns
intensity_colnames <- colnames(full_data)[grepl("^LFQ\\.intensity\\.", 
                                                colnames(full_data))]

# New data set which only contains the intensity columns
data <- as.matrix(full_data[, intensity_colnames])
colnames(data) <- sub("^LFQ\\.intensity\\.", "", intensity_colnames)
data[data == 0] <- NA
data <- log2(data)
data[1:7, 1:6]
rownames(data) <- full_data$Protein.IDs

In the next step I will create an annotation data.frame that contains information on the sample name, the condition and the replicate.

annotation_df <- data.frame(
    Condition = sub("\\.\\d+", "", sub("^LFQ\\.intensity\\.", 
                                       "", intensity_colnames)),
    Replicate = as.numeric(sub("^LFQ\\.intensity\\.[[:alnum:]]+\\.", 
                               "", intensity_colnames)),
    stringsAsFactors = FALSE, row.names = colnames(data)
)
head(annotation_df)

Optionally, we can turn this into a SummarizedExperiment or MSnSet object

library(SummarizedExperiment)
library(MSnbase)
library(SummarizedExperiment)
se <- SummarizedExperiment(SimpleList(LFQ=data), colData=annotation_df)
rowData(se) <- full_data[, c("Only.identified.by.site", 
                             "Reverse", "Potential.contaminant")]
se
library(MSnbase)

fData <- AnnotatedDataFrame(full_data[, c("Only.identified.by.site", 
                                 "Reverse", "Potential.contaminant")])
rownames(fData) <- rownames(data)
ms <- MSnSet(data, pData=AnnotatedDataFrame(annotation_df), fData=fData)
ms

Tidyverse

The tidyverse is a set of coherent R packages that provide many useful functions for common data analysis tasks. It replicates many of the functionalities already available in base R packages, but learns from its mistakes and avoids some of the suprising behaviors. For example strings are never automatically converted to factors. Another popular feature in the tidyverse is the pipe operator (%>%) that makes it easy to chain complex transformations.

library(dplyr)
library(stringr)
library(readr)
library(tidyr)
library(tibble)
library(dplyr)
library(stringr)
library(readr)
library(tidyr)
library(tibble)
# Or short 
# library(tidyverse)

I first load the full data file

# The read_tsv function works faster and more reliable than read.delim
# But it sometimes needs help to identify the right type for each column,
# because it looks only at the first 1,000 elements. 
# I do that here for the Reverse column.
full_data <- read_tsv(
    system.file("extdata/proteinGroups.txt", 
                package = "proDD", mustWork = TRUE),
    col_types = cols(Reverse = col_character())
)

full_data

Next, I create a tidy version of the data set. I pipe (%>%) the results from each transformation to the next transformation, to first select the columns of interest, reshape (gather) the dataset from wide to long format, and lastly create new columns with mutate.

tidy_data <- full_data %>%
    dplyr::select(ProteinID=Protein.IDs, starts_with("LFQ.intensity.")) %>%
    gather(Sample, Intensity, starts_with("LFQ.intensity.")) %>%
    mutate(Condition = str_match(Sample, 
                 "LFQ\\.intensity\\.([[:alnum:]]+)\\.\\d+")[,2]) %>%
    mutate(Replicate = as.numeric(str_match(Sample, 
                 "LFQ\\.intensity\\.[[:alnum:]]+\\.(\\d+)")[,2])) %>%
    mutate(SampleName = paste0(Condition, ".", Replicate))

tidy_data

Using the tidy data, I create the annotation data frame and the data matrix.

data <- tidy_data %>%
    mutate(Intensity = ifelse(Intensity == 0, NA, log2(Intensity))) %>%
    select(ProteinID, SampleName, Intensity) %>%
    spread(SampleName, Intensity) %>%
    column_to_rownames("ProteinID") %>%
    as.matrix()

annotation_df <- tidy_data %>%
    select(SampleName, Condition, Replicate) %>%
    distinct() %>%
    arrange(Condition, Replicate) %>%
    as.data.frame() %>%
    column_to_rownames("SampleName")

Optionally, we can again turn this into a SummarizedExperiment or MSnSet object

library(SummarizedExperiment)
se <- SummarizedExperiment(SimpleList(LFQ=data), colData=annotation_df)
rowData(se) <- full_data[, c("Only.identified.by.site", 
                             "Reverse", "Potential.contaminant")]
se
library(MSnbase)

fData <- AnnotatedDataFrame(full_data[, c("Only.identified.by.site", 
                                 "Reverse", "Potential.contaminant")])
rownames(fData) <- rownames(data)
ms <- MSnSet(data, pData=AnnotatedDataFrame(annotation_df), fData=fData)
ms

Proteus

Proteus is an R package specifically designed for the analysis of MaxQuant output. But as the package is not yet on CRAN, I cannot run the code

library(proteus)

To load the data I use the readProteinGroups() function. It needs a vector of all column names that contain the intensity values, which are stored in the meta data frame.

meta <- data.frame(
    sample = c("LFQ.intensity.CG1407.01", "LFQ.intensity.CG1407.02",  "LFQ.intensity.CG1407.03",  "LFQ.intensity.CG4676.01",  "LFQ.intensity.CG4676.02", "LFQ.intensity.CG4676.03", "LFQ.intensity.CG51963.01", "LFQ.intensity.CG51963.02", "LFQ.intensity.CG51963.03","LFQ.intensity.CG5620A.01", "LFQ.intensity.CG5620A.02", "LFQ.intensity.CG5620A.03", "LFQ.intensity.CG5620B.01","LFQ.intensity.CG5620B.02", "LFQ.intensity.CG5620B.03", "LFQ.intensity.CG5880.01", "LFQ.intensity.CG5880.02",  "LFQ.intensity.CG5880.03",  "LFQ.intensity.CG6017.01",  "LFQ.intensity.CG6017.02", "LFQ.intensity.CG6017.03", "LFQ.intensity.CG6618.01",  "LFQ.intensity.CG6618.02",  "LFQ.intensity.CG6618.03",  "LFQ.intensity.CG6627.01", "LFQ.intensity.CG6627.02", "LFQ.intensity.CG6627.03",  "LFQ.intensity.CG8314.01", "LFQ.intensity.CG8314.02",  "LFQ.intensity.CG8314.03", "LFQ.intensity.GsbPI.001", "LFQ.intensity.GsbPI.002",  "LFQ.intensity.GsbPI.003",  "LFQ.intensity.S2R.01", "LFQ.intensity.S2R.02", "LFQ.intensity.S2R.03"),
   condition = c("CG1407", "CG1407", "CG1407", "CG4676", "CG4676", "CG4676", "CG51963", "CG51963", "CG51963", "CG5620A", "CG5620A", "CG5620A", "CG5620B", "CG5620B", "CG5620B", "CG5880", "CG5880", "CG5880", "CG6017", "CG6017", "CG6017", "CG6618", "CG6618", "CG6618", "CG6627", "CG6627", "CG6627", "CG8314", "CG8314", "CG8314", "GsbPI", "GsbPI", "GsbPI", "S2R", "S2R", "S2R" ),
  stringsAsFactors = FALSE
)


head(meta)

# The columns in the example file are separated with a dot instead of a space
data_col_names <- list(protein = "Majority.protein.IDs", reverse = "Reverse", 
                       contaminant = "Potential.contaminant")

proteus_data <- readProteinGroups(
            file = system.file("extdata/proteinGroups.txt",
            package = "proDD", mustWork = TRUE),
            meta = meta, measure.cols = setNames(meta$sample, meta$sample),
            data.cols = data_col_names)

After loading the data I turn the proteusData object into a useable matrix and create the annotation data frame.

data <- proteus_data$tab
data <- log2(data)
annotation_df <- proteus_data$metadata
rownames(annotation_df) <- annotation_df$sample
annotation_df$sample <- NULL

Optionally, we can again turn this into a SummarizedExperiment or MSnSet object

library(SummarizedExperiment)
se <- SummarizedExperiment(SimpleList(LFQ=data), colData=annotation_df)
se
library(MSnbase)

fData <- AnnotatedDataFrame(data.frame(row.names = rownames(data)))
ms <- MSnSet(data, pData=AnnotatedDataFrame(annotation_df), fData=fData)
ms

DEP

DEP has a similar motivation as proteus. It provides helper functions to impute missing values and makes it easy to call limma on the result.

To load the data, we again need to provide all the column names of the intensity values. I then call the import_MaxQuant() function that directly creates a SummarizedExperiment object.

library(DEP)

full_data <- read.delim(
    system.file("extdata/proteinGroups.txt", 
                package = "proDD", mustWork = TRUE),
    stringsAsFactors = FALSE
)


exp_design <- data.frame(
    label =c("LFQ.intensity.CG1407.01", "LFQ.intensity.CG1407.02",  "LFQ.intensity.CG1407.03",  "LFQ.intensity.CG4676.01",  "LFQ.intensity.CG4676.02", "LFQ.intensity.CG4676.03", "LFQ.intensity.CG51963.01", "LFQ.intensity.CG51963.02", "LFQ.intensity.CG51963.03","LFQ.intensity.CG5620A.01", "LFQ.intensity.CG5620A.02", "LFQ.intensity.CG5620A.03", "LFQ.intensity.CG5620B.01","LFQ.intensity.CG5620B.02", "LFQ.intensity.CG5620B.03", "LFQ.intensity.CG5880.01", "LFQ.intensity.CG5880.02",  "LFQ.intensity.CG5880.03",  "LFQ.intensity.CG6017.01",  "LFQ.intensity.CG6017.02", "LFQ.intensity.CG6017.03", "LFQ.intensity.CG6618.01",  "LFQ.intensity.CG6618.02",  "LFQ.intensity.CG6618.03",  "LFQ.intensity.CG6627.01", "LFQ.intensity.CG6627.02", "LFQ.intensity.CG6627.03",  "LFQ.intensity.CG8314.01", "LFQ.intensity.CG8314.02",  "LFQ.intensity.CG8314.03", "LFQ.intensity.GsbPI.001", "LFQ.intensity.GsbPI.002",  "LFQ.intensity.GsbPI.003",  "LFQ.intensity.S2R.01", "LFQ.intensity.S2R.02", "LFQ.intensity.S2R.03"),
   condition = c("CG1407", "CG1407", "CG1407", "CG4676", "CG4676", "CG4676", "CG51963", "CG51963", "CG51963", "CG5620A", "CG5620A", "CG5620A", "CG5620B", "CG5620B", "CG5620B", "CG5880", "CG5880", "CG5880", "CG6017", "CG6017", "CG6017", "CG6618", "CG6618", "CG6618", "CG6627", "CG6627", "CG6627", "CG8314", "CG8314", "CG8314", "GsbPI", "GsbPI", "GsbPI", "S2R", "S2R", "S2R" ),
   replicate = rep(1:3, times=12),
  stringsAsFactors = FALSE
)

se <- import_MaxQuant(full_data, exp_design)
se
assay(se)[1:5, 1:5]


const-ae/proDD documentation built on Jan. 14, 2020, 9:34 a.m.