knitr::opts_chunk$set(echo = TRUE)

The motifFUN package is an R package designed to search for the motifs of interest using user-defined regular expression searches and hidden Markov models (HMM) along with a prediction of some functional criteria such as signal peptides, subcellular location of Eukaryotes and transmembrane helices.

Several command line tools and web-interfaces exist to perform predictions of individual motifs and domains (SignalP, TargetP, TMHMM) however the interface that combines the outputs in a single flexible workflow is lacking, So that developed a motifFUN package to fulfill that gap.

1.OUTLINE

The motifFUN package provides a platform to search motifs and build automated multi-step secretome prediction pipelines that can be applied to large protein. The features of this package is described below.

ORF extraction

In this package get_orf function extract six-frame translation of all ORF(open reading frame) in the genome sequence, To perform this we should recommend installing EMBOSS getorf.

Pattern Search

This package searching user-defined motif of interest using a regular expression for both genome and protein fasta sequences.

HMMER

Functional domains

Secretome prediction often involves multiple steps.

\centerwork flow diagram\center work flow diagram{width=520px}

Summary of Functionality:

|S/No |Function Name | Function |Input parameters |Dependencies | |-----|--------------------|----------------------------------------------------|---------------------------|----------------------| | 1 |get_orf() |Extraction of ORF from genome sequence |Genome fasta file | EMBOSS getorf
| 2 |orf_discard() |Discards residues by lower limit to upper limit(length)|protein fasta file, upper limit, lower limit | No dependencies |
| 3 |parse_file() |To covert single fasta file into multiple fasta files |Input file, Output file name, Number of proteins to be parse|No dependencies | | 4 |pattern.search() |For searching user_defined motif| Fasta sequence file, Regular expression pattern|No dependencies | | 5 |hmm.search() |For searching sequence motifs from sequence database|Original input fasta file, Pattern search candidates, MAFFT path, HMMER path| HMMER|
| 6 |get_signalp() |To predict secretory proteins|version, Complete Path Signalp, Input file, organism type|signalp-3.0/signalp-5.0|
| 7 |get_targetp() |To predict subcellular location of eukaryotes |Path of targetp, organism group, input file|TargetP 1.1| | 8 |get_tmhmm() |To predict transmembrane helices |complete Path of tmhmm, Input file| Tmhmm | | 9 |hmm.plot() |The plot shows the bits (amino acid scores) of each amino acid and its position in the HMM profile|motif candidate data frame|No dependencies | |10 |summary_motifs() |To extract all the non-redundant sequences & a summary table with the information about the motifs|motif candidates, motif pattern without range|No dependencies| | | | |

Due to limitations imposed by the external dependencies, some of the motifFUN wrapper(get_signalp, get_targetp, get_tmhmm) functions won't work in Windows or Mac, however, are fully functional on Linux.

2.REQUIREMENTS

R packages:

|ID|Name|Function| |------|---|-----| |1|seqinr|Reading fasta file| |2|ggplot2|used for producing Plots|

External Tools:

|ID|Name|Function| |------|---|-----| |3|signalp 3.0,signalp 5.0|To predict secretary proteins| |4|targetp 1.1|To prediction of sub cellular location of eukaryotes| |5|tmhmm|To predicts trans membrane helices| |6|mafft|For alignment of sequences| |7|HMMER|For searching motifs| |8|EMBOSS|For extraction of six-frame tranlastion of ORF|

3.EXTERNAL SOFTWARES

The motifFUN package uses signalp, targetp, tmhmm for prediction of extracellular proteins that are secreted via classical pathways, getorf for extraction of ORF.

MAFFT and HMMER3 used to perform the hidden Markov model search across the results from the REGEX step.

These packages should be installed before running any of the motifFUN functions.

$\color{Maroon}{\text{3.1.Downloading EMBOSS}}$

Read instructions and install

$\color{Maroon}{\text{3.2.Downloading signalP}}$

3.2.1.signalp-3.0

tar -zxvf signalp-3.0.Linux.tar.Z

$\color{blue}{\text{cd}}$ signalp-3.0

Edit"General settings" at the top of the signalp file. Set the value of 'SIGNALP' variable to be a path to your signalp-3.0 directory. Other variables usually do not require changes. For more details please check signalp-3.0.readme.

3.2.2.signalp 5.0

tar -zxvf signalp-5.0.Linux.tar.gz

$\color{blue}{\text{cd}}$ signalp-5.0

$\color{Maroon}{\text{3.3.Downloading targetp-1.1}}$

3.3.targetp-1.1

tar -zxvf targetp-1.1b.Linux.tar.Z

$\color{blue}{\text{cd}}$ targetp-1.1

Edit the paragraph labeled "GENERAL SETTINGS, customize" at the top of the targetp file. Set values for 'TARGETP' and 'TMP' variables. Ensure, that the path to targetp does not exceed 60 characters, otherwise targetp-1.1 might fail.

$\color{Maroon}{\text{3.4.Downloading tmhmm,v.2.0}}$

3.4.tmhmm,v.2.0

tar -zxvf tmhmm-2.0c.Linux.tar.gz

$\color{blue}{\text{cd}}$ tmhmm-2.0c

$\color{Maroon}{\text{3.5.Downloading and installing MAFFT}}$

3.5.MAFFT

MAFFT is a multiple sequence alignment program that uses Fourier-transform algorithms to align multiple sequences[@Katoh2002]. We recommend downloading and installing MAFFT by following the instructions and steps in the MAFFT installation web site.

 wget https://mafft.cbrc.jp/alignment/software/mafft_7.427-1_amd64.deb

Linux/OS X Users

Make sure that you remember the directory in which MAFFT is installed, or if the installation is successful, make sure to obtain the path via bash/tsh/console:

On the Ubuntu window, run the following command to download MAFFT package.

$ wget https://mafft.cbrc.jp/alignment/software/mafft_7.427-1_amd64.deb

For extraction

$ sudo dpkg -i mafft_7.427-1_amd64.deb
[sudo] password for username: (Type the password that was set in step 2)

Check location and version number of MAFFT.

which mafft
/usr/local/bin/mafft

For more information about MAFFT go to the MAFFT website: http://mafft.cbrc.jp/

Windows Users

MAFFT comes in two main distributions for windows:

Please, download and install the all-in-one version.

$\color{Maroon}{\text{3.6.Downloading and installing HMMER}}$

3.6.HMMER

HMMER is used for searching sequence databases for sequence homologs. It uses hidden Markov models[@Finn2011] (profile HMMs) to search for sequences with hits to similar patterns than the profile. We use three main HMMER tools:

The motifFUN package requires all of these tools. A correct HMMER installation will install all three programs.

Linux/OS X users

We recommend downloading and installing HMMER by following the instructions and steps in the HMMER installation web site. Make sure that you remember the directory in which HMMER is installed, or if the installation is successful, make sure to obtain the path via bash/tsh/console:

which hmmbuild
which hmmpress
which hmmsearch

/usr/local/bin/hmmbuild
/usr/local/bin/hmmpress
/usr/local/bin/hmmsearch

For more information about HMMER go to the HMMER website: http://hmmer.org/

Windows users

To use the motifFUN package in Windows, the user must download the Windows binaries of HMMER. motifFUN will not work with any other version of HMMER.

4.WORK FLOW

4.1.Input Data

The motifFUN package design to predicts sequence motifs for both nucleotide and amino acid sequence by using a user-defined regular expression. This package supports both Gene FASTA and protein FASTA file as input.

The getorf function can be used to translate gene fasta to protien fasta.

INPUT

library(motifFUN)
library(seqinr)
library(ggplot2)
orf_fasta = system.file("tests","testfile.fasta", package = "motifFUN")
4.2.ORF extraction

Emboss getorf is a software tool to finds and outputs the sequences of open reading frames (ORFs) in one or more nucleotide sequences. An ORF is a part of the reading frame has the ability to be translated. ORF having the continuous stretch of codons begin with a start codon(AUG) and stop codon(UAA, UAG or UGA).

ORF_filename <- get_orf(getorf.path= NULL, input.file= orf_fasta, output.file = NULL)
4.3.ORF discard

Bhattacharjee et al.[@Bhattacharjee2006] noted that P. infestans–candidate effectors that contain at least 100 residues after the predicted SS cleavage site, highlighting the conservation of the RxLR motif.

ORF_disacrd_filename <- orf_discard(orf_file= ORF_filename, upper.limit= 1800, lower.limit= 100)
4.4.Motif Pattern Search

motifFUN package has the function pattern.search to perform the search of the motif of interest.

Example with sample data:

Here we show an example to search for sequences with RxLR-EER motifs from the 63 ORF subset of testfile.fasta This ORF example data set contains 45 sequences with RxLR-EER motifs.

pattern <- "^\\w{10,40}\\w{1,96}R\\wLR\\w{1,40}eer"
PATT_REG <- pattern.search(fasta.file = ORF_disacrd_filename, reg.pat = pattern)
head( PATT_REG, n = 2)

This function generates one fasta file having motif sequences, We observe that the PATT_REG object has 24 sequences with the RxLR motif. These sequences will be aligned using MAFFT and used to build an HMM profile to search for similar sequences.

This fasta file will become input to signalp, targetp, tmhmm functional programs.

4.5.MAFFT & HMMER Search

To perform the HMM search and obtain all possible motif from a proteome, motifFUN uses the PATT_REG results as a template to create an HMM profile and perform a search across the proteome of interest. MotifFUN package have the hmm.search function in order to perform this search. The hmm.search function requires a local installation of MAFFT and HMMER in order to perform the searches.

The absolute paths of the binaries must be specified in the mafft.path and hmmer.path options of the hmm.search function.

Note for Windows users: Please use the ABSOLUTE PATH for HMMER and MAFFT or motifFUN will not work (e.g. mafft.path ="C:/User/Banana/Desktop/mafft/")

In addition, the hmm.function requires the path of the original FASTA file containing the translated ORF in the original.seq parameter of the function. hmm.search will use this file as a query in the hmm search software from HMMER, and search for all sequences with hits against the HMM profile created with the PATT_REG results.

We are performing a hmmsearch in our example data set. This function requires original FASTA file location (stored in the filepath object), the location of the MAFFT binary and the location of the HMMER binaries:

If user did not provide MAFFT & HMMER path then by default function will take MAFFT & HMMER paths.

motif_candidates <- hmm.search(original.seq = ORF_disacrd_filename, regex.seq = PATT_REG, mafft.path = NULL, hmm.path = NULL)

The hmm.search function has resulted in 41 motif candidates. As a reminder, we used the PATT_REG results of an RxLR motif search, so we can consider this hmm.search results as RxLR candidate effectors.

The hmm.search object returns a list of 3 elements:

This function combines and returns PATT_REG and HMM search resulted in fasta file, User can use this file as input to functional domains or can use PATT_REG resulted fasta file as input to functional domains depends on the user.

4.6.signalp

The signal peptide is a short peptide usually 16-30 amino acids long [@Hemminger1998]present at the N-terminus of the majority of newly synthesized proteins that are bound towards the secretory pathway. These proteins incorporate those that reside either inside certain organelles, secreted from the cell, or inserted into most cellular membranes.

signalP is a software tool to predicts the presence and location of signal peptide cleavage sites[@Nielsen1997] in amino acid sequences from different organisms: Gram-positive prokaryotes, Gram-negative prokaryotes, and eukaryotes.

get_signalp function from motifFUN package provides an interface for two versions of signalp-3.0 and signalp-5.0.

get_signalp() requires to provide version, signalp path, input file, organism type argument:

For signalp-3.0

For signalp-5.0

If the input fasta files containing more than 300 sequences, get_signalp will automatically switch to parallel mode. It will split the input into smaller chunks and run prediction as a massive parallel process using a specified number of CPUs. If user did not provide signalp path then by default function takes the latest version of signalp-5 path. For signalp-3 version user needs to provide signalp-3 path.

input_file <- "testfile_orf_descard_REGEX.fasta"
signalp5 <- get_signalp(signalp.path = NULL, signalp.version = 5, input_file = input_file, org.type = "-org euk")
head(signalp5)

This function generates signalp resulted text file and returns summary dataframe.

4.7.targetp

TargetP 1.1 is a software tool to predicts the subcellular location of eukaryotic proteins.TargetP provides a potential cleavage site for sequences predicted to contain a cTP, mTP or SP. The get_targetp function requires to provide a targetp path, organism group, input file.

If user did not provide targetp path then by default function will take targetp path.

targetp <- get_targetp(targetp.path = NULL, organismgroup = "-P", file = input_file)
head(targetp)

LOCALIZATION column:

4.8.tmhmm

TMHMM predicts transmembrane α-helices and identifies integral membrane proteins based on HMMs [@Krogh2001a].

Transmembrane helical domain
{width=320px}

The get_tmhmm function requires to provides a tmhmm path, input fasta file, If user did not provide tmhmm path then by default function will take tmhmm path.

tmhmm <- get_tmhmm(tmhmm.path = NULL, file = "testfile_orf_descard_REGEX.fasta")
head(tmhmm)
4.9.motif summarys and motif sequences

The user can extract all of the non-redundant sequences and a summary table with the information about the motifs using the summary_motifs function. This function uses the results from either hmm.search or pattern. search functions to generate a table that includes the name of the candidate motif sequence, the number of motifs of interest per sequence and its location within the sequence.

motif_summary <- summary_motifs(hmm.result = motif_candidates, reg.pat= pattern, signalp_version = 5, input_file = input_file)
head(motif_summary$consensus.sequences, n = 2)
head(motif_summary$motif.table, n=5)
4.10.Visuvalizing HMM profile

To determine if the HMM profile includes the motifs of interest, MotifFUN have The function hmm.plot reads the HMM profile (obtained from the hmm.search step) and uses ggplot2 to create a point plot. The plot will illustrate the bits (amino acid scores) of each amino acid used to construct the HMM profile according to its position in the HMM profile.

hmm.plot(hmm_data = motif_candidates$HMM_Table)

sessioninfo

sessionInfo()

References



computational-genomics-lab/motifFUN documentation built on June 4, 2019, 7:52 a.m.