This tutorial is still under development and may change slightly as it is being edited.
This tutorial is relevant to the published paper RNA exploits an exposed regulatory site to inhibit the enzymatic activity of PRC2. However, the workflow can be used for data processed by MaxQuant. The data for this tutorial is available here.The relevant files are located within the .zip files for each repeat as MaxQuant_txt_folder_output/peptides.txt. For simplicity, all "peptides.txt" files have been re-named according to their experiment numbers as part of the external data available as part of the R package.
In this tutorial, we w
require(knitr) #get the doi from the manuscript #make the directory that has everything in it with the right naming conventions knitr::opts_knit$set(root.dir = '/Users/emmagail/Downloads/NSMB_RBDmap')
First, load the libraries needed for the package and tutorial.
library(ggplot2) library(bio3d) library(Biostrings) library(seqinr) library(RColorBrewer) library(openxlsx) library(viridis) library(crisscrosslinker) library(stringr) library(svglite) #source('~/Documents/monash/project.functions/R/pdb_functions.R')
library(ggplot2) library(bio3d) library(Biostrings) library(seqinr) library(RColorBrewer) library(openxlsx) library(viridis) library(stringr) library(svglite) library(crisscrosslinker)
When loading the data, it is important to make sure that the files for input are named well so the functions will be able to correctly identify and organize the data.
sys.path <- system.file("extdata/NSMB_RBDmap",package = 'crisscrosslinker',mustWork = TRUE) sys.files <- list.files(sys.path) sys.files
Each of the file names contains 3 pieces of information: the name of the experiment, if it is an input/eluate file, and which protease was used for the experiment.
head(read.table(paste0(sys.path,'/',sys.files)[1], sep='\t',header = TRUE))
If comparing between multiple experiments, indicate the experiment names in the code.
#experiment_names <- c('experiment1','experiment2') experiment_names <- c('Repeat02_NoXL','Repeat02_UV')
We will also need to load the FASTA file that was used during the initial analysis to get the tryptic peptides. For more information on how this FASTA file was generated, please refer to the paper.
fasta_file <- seqinr::read.fasta(system.file("extdata/NSMB_FASTA",'PRC2_5m.fasta',package = 'crisscrosslinker',mustWork = TRUE))
If the file format is from MaxQuant with an Andromeda score, you will need to set it. The default setting is 20 and we will explicitly set it in this demonstration.
cutoff_score <- 20
We are now ready to get a list of the relevant hits for our data.
sequence_hit_list <- rbd.makeSeqHitList(fasta_file = fasta_file, file_format = 'txt', cutoff_score = cutoff_score, experiment_directory = sys.path) sequence_hit_list[[2]]
Here we have found all of the tryptic peptides that:
Since we want to compare the input and eluate data from the experiments, let's make a table comparing the two.
input_eluate_table <- rbd.makeIETable(sequence_hit_list, experiment_names)
We can see how the results look here:
head(input_eluate_table)
We can now move on to making a plot.
As you can see, many of the hits that were found in the input were not found in the eluate. We will filter those later. For now, let's plot everything.
input_eluate_graph <- rbd.makeIEPlot(input_eluate_table, experiment_names) print(input_eluate_graph)
We can change the default settings of the plot, including colors and the title by inputting some more variables:
name_of_experiment <- 'RBDmap Experiments: Input vs. Eluate' input_eluate_graph <- rbd.makeIEPlot(input_eluate_table, experiment_names, palette = 'Pastel1', fill = 'white', colour = 'black', size = 1, title = name_of_experiment) print(input_eluate_graph)
Now that we've compared our control, let's remove all results that do not show up in the eluate files.
filtered_input_eluate_table <- input_eluate_table[input_eluate_table$eluate > 0,] head(filtered_input_eluate_table)
We'll use these results for finding binding sequences. We'll first get a data.frame of the binding sequences aligned to the FASTA file.
bs_output_fasta <- rbd.getBSfromIET(filtered_input_eluate_table, fasta_file = fasta_file, align_to = 'FASTA') head(bs_output_fasta[,1:6])
Let's make a heatmap and sort the sequences by their names.
bs_freqVector <- rbd.freqVector(bs_output = bs_output_fasta, name_by = 'pro_name', heatmap = TRUE, db_selection = 'FASTA')
We can also adjust the colors of the heatmap if we so wish. You can put in a standard RColorBrewer
or viridis
palette or a list of hexcodes and/or standard R Colors.
bs_freqVector <- rbd.freqVector(bs_output = bs_output_fasta, name_by = 'pro_name', heatmap = TRUE, db_selection = 'FASTA', colors = "Blues", save_plot = FALSE)
#do a grid of different color options? bs_freqVector <- rbd.freqVector(bs_output = bs_output_fasta, name_by = 'pro_name', heatmap = TRUE, db_selection = 'FASTA', colors = c('#b3ecec','#43e8d8','#48d1cc'), save_plot = FALSE)
bs_freqVector <- rbd.freqVector(bs_output = bs_output_fasta, name_by = 'pro_name', heatmap = TRUE, db_selection = 'FASTA', colors = c("#465ca8","#7767b8","#a272c4", "#cc7ecc",'#f38bd1'), #palette from color-hex.com save_plot = FALSE)
If you require further customization, we recommend using the freqVector directly within a heatmap and customizing from there.
We will now align the binding sequences to a PDB file. For simplicity, we'll only focus on one protein: EZH2.
ezh2_iet <- filtered_input_eluate_table[filtered_input_eluate_table$protein == 'EZH2_Q15910-2',]
We already know the PDB file and chain that we want to align the proteins to. In this case, we can create and add a new table:
protein_dict <- read.csv(system.file("extdata/NSMB_misc",'PRC2_protein_dict.csv',package = 'crisscrosslinker',mustWork = TRUE)) head(protein_dict)
Now we don't have to go through menus to be able to define the PDB ID/chain that we need to use. This will save us time when making our table of binding sequences.
You can always leave the dictionary blank in order to access an interactive menu to be able to use BLAST in order to align your sequence to a known PDB structure.
bs_ezh2 <- rbd.getBSfromIET(ezh2_iet, fasta_file = fasta_file, align_to = 'PDB', protein_dict = protein_dict)
Let's see how many matches we had to the chain we chose:
bs_ezh2$db
2 matched our chain! Let's visualize it.
Now that we have the binding sequences aligned to a PDB file, we can create a PyMOL file.
rbd.pymol(bs_ezh2, color_by = 'binding_sequence', colors = NULL)
{width=100%}
There is an additional legend with the colors corresponding to the binding sequences. The colors will be assigned based on the order the binding sequences are in. If you would like the sequences in a different order, please order them before executing rbd.pymol().
If we do not choose any colors, it will default to the PyMOL tints palette. However, we can input any RColorBrewer or viridis palette as well as standard PyMOL/R colors or hexcodes.
par(mfrow=c(2,2)) rbd.pymol(bs_ezh2, color_by = 'binding_sequence', colors = 'Dark2', file.name = 'pymol_Dark2.pml') #RColorBrewer palette rbd.pymol(bs_ezh2, color_by = 'binding_sequence', colors = 'magma', file.name = 'pymol_magma.pml') #viridis palette rbd.pymol(bs_ezh2, color_by = 'binding_sequence', colors = c('red','blue'), file.name = 'pymol_rgb.pml') #standard colors rbd.pymol(bs_ezh2, color_by = 'binding_sequence', colors = c("#f8a1be","#ffb6b1"), file.name = 'pymol_hexcodes.pml') #hexcodes
![](/Users/emmagail/Downloads/NSMB_RBDmap/Asset 1prc2_rbd_all.png){width=100%}
We have now looked at one experiment, but now we want to look at multiple experiments we have run and compare between them. First we will do what we have done before, but combine the different runs together to examine how often these binding sites appear.
Each of these experiments have slighly different prefixes, so we will adjust them accordingly with a list. In this instance, we could probably get away with just using "UV" for all 3 repeats. However, for the purposes of this tutorial, we will explicitly define the prefixes.
We're now ready to proceed to get the binding sequences for each of these repeats.
bs_output_diff_analysis <- data.frame() num_repeats <- 1:4 repeat_names <- paste0('Repeat0',num_repeats) for(rname in repeat_names){ shl <- sequence_hit_list[startsWith(names(sequence_hit_list),rname)] enames <- paste0(rname,'_',c('UV','NoXL')) input_eluate_table <- rbd.makeIETable(shl, enames) filtered_input_eluate_table <- input_eluate_table[input_eluate_table$eluate > 0,] bs_output_pdb <- rbd.getBSfromIET(filtered_input_eluate_table, fasta_file = fasta_file, align_to = 'PDB', protein_dict = protein_dict) bs_output_diff_analysis <- rbind(bs_output_diff_analysis,data.frame(bs_output_pdb)) }
Now that we have loaded the data into our R environment, let's make a PyMOL file:
rbd.pymol(bs_output_diff_analysis, color_by = 'freq', colors = 'magma', file.name = 'pymol_magma_diff_analysis.pml')
It will also create a heatmap with the same color scheme as the PyMOL file if heatmap
is kept as TRUE
. A legend will also be generated to go with the PyMOL file, though it will be similar to the one generated by the heatmap.
{width=50%}
Since slightly different but overlapping peptides may show up due to the use of different proteases, the frequency is measured by the amino acid rather than the full peptide.
Castello, A. et al. Comprehensive Identification of RNA-Binding Domains in Human Cells. Molecular Cell 63, 696-710, doi:10.1016/j.molcel.2016.06.029 (2016).
Tyanova, S., Temu, T. & Cox, J. The MaxQuant computational platform for mass spectrometry-based shotgun proteomics. Nature Protocols 11, 2301-2319, doi:10.1038/nprot.2016.136 (2016).
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.