knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>",
  echo = TRUE,
  warning = FALSE,
  message = FALSE,
  prompt = FALSE,
  fig.pos = 'h',
  highlight = FALSE
)
knitr::knit_theme$set("edit-matlab")
options(width = 120)

What is ragp?

Hydroxyproline rich glycoproteins (HRGPs) are one of the most complex families of macromolecules found in plants, due to the diversity of glycans decorating the protein backbone, as well as the heterogeneity of the protein backbones. While this diversity is responsible for the wide array of physiological functions associated with HRGPs, it hinders attempts for homology based identification. Current approaches, based on identifying sequences with characteristic motifs and biased amino acid composition, are limited to prototypical sequences.

ragp is an R package for mining and analyses of HRGPs, with emphasis on arabinogalactan protein sequences (AGPs). The ragp filtering pipeline exploits one of the HRGPs key features, the presence of hydroxyprolines which represent glycosylation sites. Main package features include prediction of proline hydroxylation sites, amino acid motif and bias analyses, efficient communication with web servers for prediction of N-terminal signal peptides and glycosylphosphatidylinositol modification sites, as well as the ability to annotate sequences through CDD or hmmscan and subsequent GO enrichment, based on predicted Pfam domains.

The workflow in ragp is illustrated with the following diagram (ragp functions to be used for each of the tasks are boxed grey):

knitr::include_graphics("ragp_flow_chart.svg")

The filtering layer:

The analysis layer:

Additionally ragp provides tools for visualization of the mentioned attributes via plot_prot().

Installation

There are several ways to install R packages hosted on git-hub, however the simplest is to use devtools::install_github() which will perform all the required steps automatically.

To install ragp run:

#install.packages("devtools") #if it is not installed on your system
devtools::install_github("missuse/ragp")

alternatively run:

# install.packages("devtools")
devtools::install_github("missuse/ragp",
                         build_vignettes = TRUE)

to build vignettes which can be viewed by:

browseVignettes("ragp")

Data import

Inputs

Most ragp functions require single letter protein sequences and the corresponding identifiers as input. These can be provided in the form of basic R data types such as vectors or data frames. Additionally ragp imports the seqinr package for the manipulation of .FASTA files, so the input objects can be a list of SeqFastaAA objects returned by the seqinr::read.fasta(). The location of a .FASTA file is also possible as a type of input. As of ragp version 0.3.5 objects of class AAStringSet are also supported.

Input options will be illustrated using scan_ag() function:

library(ragp)
data(at_nsp) #a data frame of 2700 Arabidopsis sequences
input1 <- scan_ag(sequence = at_nsp$sequence,
                  id = at_nsp$Transcript.id) 
input2 <- scan_ag(data = at_nsp,
                  sequence = "sequence",
                  id = "Transcript.id") 

quoting column names is not necessary:

input3 <- scan_ag(data = at_nsp,
                  sequence = sequence,
                  id = Transcript.id) 
library(seqinr) #to create a fasta file with protein sequences

#write a FASTA file
seqinr::write.fasta(sequence = strsplit(at_nsp$sequence, ""),
                    name = at_nsp$Transcript.id, file = "at_nsp.fasta")

#read a FASTA file to a list of SeqFastaAA objects
At_seq_fas <- read.fasta("at_nsp.fasta",
                         seqtype =  "AA", 
                         as.string = TRUE) 

input4 <- scan_ag(data = At_seq_fas) 
input5 <- scan_ag(data = "at_nsp.fasta") #file at_nsp.fasta is in the working directory
dat <- Biostrings::readAAStringSet("at_nsp.fasta") #file at_nsp.fasta is in the working directory
input6 <- scan_ag(data = dat) 

All of the outputs are equal:

all.equal(input1,
          input2)

all.equal(input1,
          input3)

all.equal(input1,
          input4)

all.equal(input1,
          input5)

all.equal(input1,
          input6)

The only exceptions to this design are the plotting function plot_prot() which requires protein sequences to be supplied in the form of string vectors (input1 in the above example) and pfam2go() which does not take sequences as input.

Further reading

All ragp functions return basic R data structures such as data frames, lists of vectors and lists of data frames, making them convenient for manipulation to anyone familiar with R. An especially effective way to manipulate these objects is by utilizing the tidyverse collection of packages, especially dplyr and ggplot2. Several dplyr functions that will be especially handy for data wrangling are:

Examples on usage of these functions on objects returned by ragp functions are provided in HRGP filtering and HRGP analysis tutorials. Additionally there are extensive examples on the internet on usage of the mentioned functions.

Obtaining pretty visualizations is usually the goal of the above mentioned data manipulations. The golden standard of R graphics at present is the ggplot2 package and we recommend it to graphically summarize the data. Additionally ragp contains plot_prot() function which is a wrapper for ggplot2, and while plot_prot() can be used without knowing ggplot2 syntax, to tweak the plot style at least a basic knowledge of ggplot2 is required. Examples are provided in protein sequence visualization tutorial.

Acknowledgements

This software was developed with funding from the Ministry of Education, Science and Technological Development of the Republic of Serbia (Projects TR31019 and OI173024).

References



missuse/ragp documentation built on Jan. 4, 2022, 10:49 a.m.