In dereksonderegger/PepSeq: Tools for analyzing Peptide Sequencing results

knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)

PepSeq is a package developed to facilitate the exploration of PMI's PepSeq data. Given a pulldown, we want to explore how the cleaved-uncleaved signal varies across different locations in the proteins.

To begin, we will install the package via the following R commands:

library(devtools)
install_github('dereksonderegger/PepSeq')  # install latest version of Derek's PepSeq tools

library(tidyverse)   # load the usual set of data processing tools
library(PepSeq)      # load Derek's library of routines

The first step is to set the working directory to wherever the pulldown .csv file is located. This can be done via the point and click interface (Session -> Set Working Directory) or using the command setwd. Once the working directory is set to where ever the data file lives, you can define the input and output files.

setwd('~/GitHub/PepSeq/inst/extdata')
input_file <- 'PepSeq_vs_NN.csv'
output_file <- 'PepSeq_vs_NN.pdf'

Alternatively, you could just list the full pathname to the input and output files and ignore the working directory business.

input_file  <- '~/GitHub/PepSeq/inst/extdata/PepSeq_vs_NN.csv'
output_file <- '~/GitHub/PepSeq/inst/extdata/PepSeq_vs_NN.pdf'

PepSeq assumes that a pulldown .csv file is arrange in a format with one protein/location combination per line and then several sets of cleaved/uncleaved pairs and possibly columns of data that aren't cleaved/uncleaved but rather just some other signal (e.g. output from another model as to the likelihood of binding).

read.csv(input_file) %>% colnames()

Notice that in the example pulldown file, there is just one experiment (wTEV) and two other added columns model of neural net model predictions and if the number of literature citations. The orderings of all the columns doesn't matter, but you need to be able to identify the protein and position columns and each of the cleaved/uncleaved pairs need to some sort of cleaved/uncleaved prefix or postfix. For example, columns might have a group name followed by an 'Cleaved' or '_Uncleaved'. Alternatively, the column names might begin with 'Cl' or 'Un_'. So long as the Cleaved_Type_Indicators denotes them as a vector strings (order is cleaved then uncleaved), then the function will search for those two strings within the column name and appropriately identify the cleaved and uncleaved values. If a column name doesn't contain the 'Cleaved' or 'Uncleaved' suffix on a column then we assume the column is a measurement and we will directly pass the values through without separating the cleaved/uncleaved and doing any standardization.

In order to allow there to be many other interesting columns that should be ignored in the visualization, PepSeq requires you to indicate which columns are to be plotted by either having an initial suffix indicating it is data or to indicate the responses by column location (e.g. columns 3 to 6).

# There is a function to import from a .csv data file
data_input <- import_pulldown(
  file = input_file, 
  standardization_method = 'additive',
  protein_column = 'Protein_ID', 
  position_column = 'Start_Loc',
  read_indicator = 3:6,
  Cleaved_Type_Indicators = c('_Cleaved', '_Uncleaved')  )  # cleaved/uncleaved columns end with _Cleaved or _Uncleaved

data_input <- import_pulldown(
  file = input_file,                # input file
  standardization_method = 'additive',
  protein_column = 'Protein_ID',    # which is the protein name
  position_column = 'Start_Loc',    # which is the starting location
  read_indicator = c(3,4,5,6),      # column range to be visualized
  Cleaved_Type_Indicators = c('_Cleaved', '_Uncleaved') )

Notice that in each of the calls, we also had to specify which column specified the protein name and which column denotes the position along the protein.

Once the data has been read in, it can then be passed to the visualization program. In the visualization program, you can control the final height and width of the resulting figure as well as the maximum and minimum values for the y-axis. By including a peaks=TRUE option, the peaks will be highlighted. The default peak fitting procedure is 'Peaks over Threshold' and requires a theshold parameter, which can be set via the peak_param option.

plot_pulldown( 
  input = data_input,               # input data
  output_file = output_file,        # output file name
  peaks = TRUE,                     # Show the peaks
  peak_method = 'PoT',              # Use "Peaks over Threshold"
  peak_param = c(.5, 90, 10) )      # thresholds are by row

Finally if we want to look at the interactive Shiny version, all the options are the same, but we don't need the output_file.

plot_pulldown_Shiny( 
  data_input,                       # input data
  peaks = TRUE,                     # Show the peaks
  peak_method = 'PoT',              # Use "Peaks over Threshold"
  peak_param = c(.5, 90, 10) )      # thresholds are by row

Notice in the Shiny application there are options to rescale the min and max of the y-axis, the window width (to squish or stretch the data points apart), as well as to select which rows are shown. While it is possible to resize the window and have the x-axis expand with the window, the same is frustratingly not possible for the y-axis. To get around this, I've added a height option that can be set to create a larger or smaller graph.

plot_pulldown_Shiny( 
  data_input,                       # input data
  peaks = TRUE,                     # Show the peaks
  peak_method = 'PoT',              # Use "Peaks over Threshold"
  peak_param = c(.5, 90, 10),       # thresholds are by row 
  height = 500 )                    # The graph height (in pixels)

A similar trick can be done for the scaling the height and width of the pdf output. This is particularly handy for stretching the x-axis to get a bit more readability.

plot_pulldown( 
  input = data_input,               # input data
  output_file = output_file,        # output file name
  peaks = TRUE,                     # Show the peaks
  peak_method = 'PoT',              # Use "Peaks over Threshold"
  peak_param = c(.5, 90, 10),       # thresholds are by row 
  height=8, width = 120 )

The reason I've allowed for a way to separate the data import from the visualization is that it can be useful to do a little pre-processing. For example, perhaps we want to only look at a few of the experiments instead of all of them.

input_file  <- '~/Dropbox/NAU/Research/PepSeq/Pulldown Visualization/Example_SeveralRows_3-27.csv'
output_file <- '~/Dropbox/NAU/Research/PepSeq/Pulldown Visualization/Example_SeveralRows_3-27.pdf'

# What columns are we looking at in this example?
read.csv(input_file) %>% colnames()

# Read in this second example set of data
data_input <- import_pulldown(
  file = input_file,                # input file
  protein_column = 'protein_ID',    # which is the protein name
  position_column = 'position',     # which is the starting location
  read_indicator = 'X',             # column range to be visualized
  Cleaved_Type_Indicators = c('_cleaved', '_uncleaved'))

# Now make a smaller set of data that only includes X01.01 and X04.04
data_small <- data_input%>%
  filter( Group %in% c('X01.01_KTX', 'X04.04_KTX') ) # Just graph these!

plot_pulldown_Shiny( 
  input = data_small,             # input data
  peaks = TRUE,                   # Show the peaks
  peak_method = 'PoT',            # Use "Peaks over Threshold"
  peak_param = c(1, 1) )          # thresholds are by row

Another thing that we want to be able to do is to plot the cleaved vs uncleaved. Suppose we have a pulldown with several different experiments and we want to explore which ones are better by plotting the cleaved vs uncleaved.

input_file  <- '~/GitHub/PepSeq/inst/extdata/counts_annotated_PTX_short.csv'

foo <- read.csv(input_file); colnames(foo); # figure out which columns to use

data_input <- import_pulldown(
  file = input_file,                # input file
  protein_column = 'protein_ID',    # which is the protein name
  position_column = 'Peptide.start',     # which is the starting location
  read_indicator = c(21:24,27:30,35:38,46:47),             # column range to be visualized
  Cleaved_Type_Indicators = c('_cleaved', '_uncleaved') )

ggplot(data_input, aes(x=uncleaved, y=cleaved)) +
  geom_point() +
  facet_wrap( ~ Group )