featureFreq: Extraction of the _K_-Mer Features of RNA and Protein...
In HAN-Siyu/ncProR: Predicting Long non-coding RNA-Protein Interaction

featureFreq

R Documentation

Extraction of the K-Mer Features of RNA and Protein Sequences

Description

Basically a wrapper for computeFreq function. This function can calculate the k-mer frequencies of RNA and protein sequences at the same time and format the results as the dataset that can be used to build classifier.

Usage

featureFreq(
  seqRNA,
  seqPro,
  label = NULL,
  featureMode = c("concatenate", "combine"),
  computePro = c("RPISeq", "DeNovo", "rpiCOOL"),
  k.Pro = 3,
  k.RNA = 4,
  EDP = FALSE,
  normalize = c("none", "row", "column"),
  normData = NULL,
  parallel.cores = 2,
  cl = NULL
)

Arguments

`seqRNA`	RNA sequences loaded by function `read.fasta` from `seqinr-package`. Or a list of RNA sequences. RNA sequences will be converted into lower case letters. Each sequence should be a vector of single characters.
`seqPro`	protein sequences loaded by function `read.fasta` from `seqinr-package`. Or a list of protein sequences. Protein sequences will be converted into upper case letters. Each sequence should be a vector of single characters.
`label`	optional. A string or a vector of strings or `NULL`. Indicates the class of the samples such as "Interact", "Non.Interact". Default: `NULL`.
`featureMode`	a string that can be `"concatenate"` or `"combine"`. If `"concatenate"`, the k-mer features of RNA and proteins will be simply concatenated. If `"combine"`, the returned dataset will be formed by combining the k-mer features of RNA and proteins. See details below. Default: `"concatenate"`.
`computePro`	a string that specifies the computation mode of protein sequence: `"RPISeq"`, `"DeNovo"`, or `"rpiCOOL"`. Ignored when `seqType = "RNA"`. Three modes indicate three different amino acid residues classifications that corresponds to the methods "RPISeq", "De Novo prediction", and "rpiCOOL". See details below. Default: `"RPISeq"`.
`k.Pro`	an integer that indicates the sliding window step of RNA sequences. Default: `4`.
`k.RNA`	an integer that indicates the sliding window step of protein sequences. Default: `3`.
`EDP`	logical. If `TRUE`, entropy density profile (EDP) will be computed. Default: `FALSE`.
`normalize`	can be `"none"`, `"row"` or `"column"`. Indicate if the frequencies should be normalized. If normalize, should the features be normalized by row (each sequence) or by column (each feature)? See details below. Default: `"none"`.
`normData`	is the normalization data generated by this function. If the input dataset is training set, or normalize strategy is `"none"` or `"row"`, just leave `normData = NULL`. If users want to build test set and the normalize strategy is `"column"`, the normalization data of the corresponding training set generated by this function should be passed to this argument. See examples.
`parallel.cores`	an integer specifying the number of cores for parallel computation. Default: `2`. Set `parallel.cores = -1` to run with all the cores. `parallel.cores` should be == -1 or >= 1.
`cl`	parallel cores to be passed to this function.

Details

see computeFreq.

Value

If normalize = "none" or normalize = "row", this function will return a data frame. Row names are sequences names, and column names are polymer names. The names of RNA and protein sequences are separated with ".", i.e. row names format: "RNASequenceName.proteinSequenceName" (e.g. "YDL227C.YOR198C"). If featureMode = "combine", the polymers of RNA and protein sequences are also separated with ".", i.e. column format: "RNAPolymerName.proteinPolymerName" (e.g. "aa.CCA").

If normalize = "column", the function will return a list containing features (a data frame named "feature") and normalization values (a list named "normData") for extracting features for test sets.

References

[1] Han S, Yang X, Sun H, et al. LION: an integrated R package for effective prediction of ncRNA–protein interaction. Briefings in Bioinformatics. 2022; 23(6):bbac420

[2] Shen J, Zhang J, Luo X, et al. Predicting protein-protein interactions based only on sequences information. Proc. Natl. Acad. Sci. U. S. A. 2007; 104:4337-41

[3] Muppirala UK, Honavar VG, Dobbs D. Predicting RNA-protein interactions using only sequence information. BMC Bioinformatics 2011; 12:489

[4] Wang Y, Chen X, Liu Z-P, et al. De novo prediction of RNA-protein interactions from sequence information. Mol. BioSyst. 2013; 9:133-142

[5] Akbaripour-Elahabad M, Zahiri J, Rafeh R, et al. rpiCOOL: A tool for In Silico RNA-protein interaction detection using random forest. J. Theor. Biol. 2016; 402:1-8

Examples

data(demoPositiveSeq)
seqsRNA <- demoPositiveSeq$RNA.positive
seqsPro <- demoPositiveSeq$Pro.positive

dataset1 <- featureFreq(seqRNA = seqsRNA, seqPro = seqsPro, label = "Interact",
                        featureMode = "comb", computePro = "DeNovo", k.Pro = 3,
                        k.RNA = 2, normalize = "row", parallel.cores = 2)

# Training set with normalization on column:

dataset2 <- featureFreq(seqRNA = seqsRNA, seqPro = seqsPro, featureMode = "conc",
                        computePro = "rpiCOOL", k.Pro = 3, k.RNA = 4,
                        normalize = "column", parallel.cores = 2)

# If build a test set with normalization on column,
# "normData" of the corresponding training set (generated by this function) is required:

dataset3 <- featureFreq(seqRNA = seqsRNA, seqPro = seqsPro, featureMode = "conc",
                        computePro = "rpiCOOL", k.Pro = 3, k.RNA = 4,
                        normalize = "column", normData = dataset2$normData,
                        parallel.cores = 2)

HAN-Siyu/ncProR documentation built on Nov. 3, 2023, 12:08 a.m.