computeMotifs: Counting the Number of Motifs in RNA or Protein Sequences

View source: R/Motifs.R

computeMotifsR Documentation

Counting the Number of Motifs in RNA or Protein Sequences

Description

Counts the number of motifs occurring in RNA/protein sequences. Motifs employed by tool "rpiCOOL" can be selected. New motifs can also be defined.

Usage

computeMotifs(
  seqs,
  seqType = c("RNA", "Pro"),
  motifRNA = c("rpiCOOL", "Fox1", "Nova", "Slm2", "Fusip1", "PTB", "ARE", "hnRNPA1",
    "PUM", "U1A", "HuD", "QKI", "U2B", "SF1", "HuR", "YB1", "AU", "UG", "selected5"),
  motifPro = c("rpiCOOL", "E", "H", "K", "R", "H_R", "EE", "KK", "HR_RH", "RS_SR", "RGG",
    "YGG"),
  newMotif = NULL,
  newMotifOnly = FALSE,
  parallel.cores = 2,
  cl = NULL
)

Arguments

seqs

sequences loaded by function read.fasta from seqinr-package. Or a list of RNA/protein sequences. RNA sequences will be converted into lower case letters, but protein sequences will be converted into upper case letters. Each sequence should be a vector of single characters.

seqType

a string that specifies the nature of the sequence: "RNA" or "Pro" (protein). If the input is DNA sequence and seqType = "RNA", the DNA sequence will be converted to RNA sequence automatically. Default: "RNA".

motifRNA

strings specifying the motifs that are counted in RNA sequences. Ignored if seqType = "Pro". Options: "rpiCOOL", "selected5", "Fox1", "Nova", "Slm2", "Fusip1", "PTB", "ARE", "hnRNPA1", "PUM", "U1A", "HuD", "QKI", "U2B", "SF1", "HuR", "YB1", "AU", and "UG". Multiple elements can be selected at the same time. If "rpiCOOL", all default motifs will be counted. "selected5" indicates the total number of the occurrences of: PUM, Fox-1, U1A, Nova, and ARE which are regarded as the five most over-presented binding motifs. See details below.

motifPro

strings specifying the motifs that are counted in protein sequences. Ignored if seqType = "RNA". Options: "rpiCOOL", "E", "H", "K", "R", "H_R", "EE", "KK", "HR_RH", "RS_SR", "RGG", and "YGG". Multiple elements can be selected at the same time. "H_R" indicates the total number of the occurrences of: H and R. "HR_RH" indicates the total number of the occurrences of: HR and RH. "RS_SR" indicates the total number of the occurrences of: RS and SR. If "rpiCOOL", the default motifs of rpiCOOL ("E", "K", "H_R", "EE", "KK", "RS_SR", "RGG", and "YGG") will be counted. See details below.

newMotif

list defining new motifs not listed above. New motifs are counted in RNA or protein sequences. For example, newMotif = list(hnRNPA1 = c("UAGGGU", "UAGGGA"), SF1 = "UACUAAC"). This parameter can be used together with parameter motifRNA or motifPro. Default: NULL.

newMotifOnly

logical. If TRUE, only the new motifs defined in newMotif will be counted. Default: FALSE.

parallel.cores

an integer specifying the number of cores for parallel computation. Default: 2. Set parallel.cores = -1 to run with all the cores. parallel.cores should be == -1 or >= 1.

cl

parallel cores to be passed to this function.

Details

This function can count the motifs in RNA or protein sequences.

The default motifs are selected or derived from tool "rpiCOOL" (Ref: [2]).

  • Motifs of RNA

    1. Fox1: UGCAUGU;

    2. Nova: UCAUUUCAC, UCAUUUCAU, CCAUUUCAC, CCAUUUCAU;

    3. Slm2: UAAAC, UAAAA, UAAUC, UAAUA;

    4. Fusip1: AAAGA, AAAGG, AGAGA, AGAGG, CAAGA, CAAGG, CGAGA, CGAGG;

    5. PTB: UUUUU, UUUCU, UCUUU, UCUCU;

    6. ARE: UAUUUAUU;

    7. hnRNPA1: UAGGGU, UAGGGA;

    8. PUM: UGUAAAUA, UGUAGAUA, UGUAUAUA, UGUACAUA;

    9. U1A: AUUGCAC;

    10. HuD: UUAUUU;

    11. QKI: AUUAAU, AUUAAC, ACUAAU, ACUAAC;

    12. U2B: AUUGCAG;

    13. SF1: UACUAAC;

    14. HuR: UUUAUUU, UUUGUUU, UUUCUUU, UUUUUUU;

    15. YB1: CCUGCG, UCUGCG;

    16. AU: AU;

    17. UG: UG.

      If "rpiCOOL", all default motifs will be counted, and there is no need to input other default motifs. "selected5" indicates the total number of the occurrences of: PUM, Fox-1, U1A, Nova, and ARE which are regarded as the five most over-represented binding motifs.

  • Motifs of protein

    1. E: E;

    2. H: H;

    3. K: K;

    4. R: R;

    5. EE: EE;

    6. KK: KK;

    7. HR ("H_R"): H, R;

    8. HR ("HR_RH"): HR, RH;

    9. RS ("RS_SR"): RS, SR;

    10. RGG: RGG;

    11. YGG: YGG.

      If "rpiCOOL", default motifs of rpiCOOL ("E", "K", "H_R", "EE", "KK", "RS_SR", "RGG", and "YGG") will be counted.

There are some minor differences between this function and the extraction scheme of rpiCOOL. In this function, motifs will be scanned directly. As to the extraction scheme of rpiCOOL, some motifs ("UG", "AU", and "H_R") are scanned in a 10 nt/aa sliding-window.

New motif patterns are also supported. Users can pass new patterns to argument "newMotif" as a list. Format:

newMotif = list(*motif_name* = c("*motif_pattern_1*", "*motif_pattern_2*")).

For example: newMotif = list(HR_RH = c("HR", "RH"), RGG = "RGG"). "HR_RH" is the name of this motif which contains two patterns: "HR" and "RH".

Value

This function returns a data frame. Row names are sequences names, and column names are motif names.

References

[1] Han S, Yang X, Sun H, et al. LION: an integrated R package for effective prediction of ncRNA–protein interaction. Briefings in Bioinformatics. 2022; 23(6):bbac420

[2] Akbaripour-Elahabad M, Zahiri J, Rafeh R, et al. rpiCOOL: A tool for In Silico RNA-protein interaction detection using random forest. J. Theor. Biol. 2016; 402:1-8

[3] Pancaldi V, Bahler J. In silico characterization and prediction of global protein-mRNA interactions in yeast. Nucleic Acids Res. 2011; 39:5826-36

[4] Castello A, Fischer B, Eichelbaum K, et al. Insights into RNA Biology from an Atlas of Mammalian mRNA-Binding Proteins. Cell 2012; 149:1393-1406

[5] Ray D, Kazan H, Cook KB, et al. A compendium of RNA-binding motifs for decoding gene regulation. Nature 2013; 499:172-177

[6] Jiang P, Singh M, Coller HA. Computational assessment of the cooperativity between RNA binding proteins and MicroRNAs in Transcript Decay. PLoS Comput. Biol. 2013; 9:e1003075

See Also

featureMotifs

Examples

data(demoPositiveSeq)
seqsRNA <- demoPositiveSeq$RNA.positive
seqsPro <- demoPositiveSeq$Pro.positive

motifRNA1 <- computeMotifs(seqsRNA, seqType = "RNA", motifRNA = "rpiCOOL",
                           parallel.cores = 2)

motifRNA2 <- computeMotifs(seqsRNA, seqType = "RNA",
                           motifRNA = c("Fox1", "HuR", "ARE"), parallel.cores = 2)

motifPro1 <- computeMotifs(seqsPro, seqType = "Pro",
                           motifPro = c("rpiCOOL", "HR_RH"), parallel.cores = 2)

# Customized motifs are also supported and can be extracted with default motifs.
# Pass new motif patterns to "newMotif" argument as a list:

motifPro2 <- computeMotifs(seqsPro, seqType = "Pro", motifPro = c("E", "K", "KK"),
                           newMotif = list(HR_RH = c("HR", "RH"), RGG = "RGG"),
                           parallel.cores = 2)

motifPro3 <- computeMotifs(seqsPro, seqType = "Pro", motifPro = c("rpiCOOL"),
                           newMotif = list(HR_RH = c("HR", "RH"), RGG = "RGG"),
                           parallel.cores = 2)

# set "newMotifOnly = TRUE", if compute customized motifs only:

motifPro4 <- computeMotifs(seqsPro, seqType = "Pro",
                           newMotif = list(HR_RH = c("HR", "RH"), RGG = "RGG"),
                           newMotifOnly = TRUE, parallel.cores = 2)

HAN-Siyu/ncProR documentation built on Nov. 3, 2023, 12:08 a.m.