featureFreq: Extraction of the _K_-Mer Features of RNA and Protein...

View source: R/Sequence.R

featureFreqR Documentation

Extraction of the K-Mer Features of RNA and Protein Sequences

Description

Basically a wrapper for computeFreq function. This function can calculate the k-mer frequencies of RNA and protein sequences at the same time and format the results as the dataset that can be used to build classifier.

Usage

featureFreq(
  seqRNA,
  seqPro,
  label = NULL,
  featureMode = c("concatenate", "combine"),
  computePro = c("RPISeq", "DeNovo", "rpiCOOL"),
  k.Pro = 3,
  k.RNA = 4,
  EDP = FALSE,
  normalize = c("none", "row", "column"),
  normData = NULL,
  parallel.cores = 2,
  cl = NULL
)

Arguments

seqRNA

RNA sequences loaded by function read.fasta from seqinr-package. Or a list of RNA sequences. RNA sequences will be converted into lower case letters. Each sequence should be a vector of single characters.

seqPro

protein sequences loaded by function read.fasta from seqinr-package. Or a list of protein sequences. Protein sequences will be converted into upper case letters. Each sequence should be a vector of single characters.

label

optional. A string or a vector of strings or NULL. Indicates the class of the samples such as "Interact", "Non.Interact". Default: NULL.

featureMode

a string that can be "concatenate" or "combine". If "concatenate", the k-mer features of RNA and proteins will be simply concatenated. If "combine", the returned dataset will be formed by combining the k-mer features of RNA and proteins. See details below. Default: "concatenate".

computePro

a string that specifies the computation mode of protein sequence: "RPISeq", "DeNovo", or "rpiCOOL". Ignored when seqType = "RNA". Three modes indicate three different amino acid residues classifications that corresponds to the methods "RPISeq", "De Novo prediction", and "rpiCOOL". See details below. Default: "RPISeq".

k.Pro

an integer that indicates the sliding window step of RNA sequences. Default: 4.

k.RNA

an integer that indicates the sliding window step of protein sequences. Default: 3.

EDP

logical. If TRUE, entropy density profile (EDP) will be computed. Default: FALSE.

normalize

can be "none", "row" or "column". Indicate if the frequencies should be normalized. If normalize, should the features be normalized by row (each sequence) or by column (each feature)? See details below. Default: "none".

normData

is the normalization data generated by this function. If the input dataset is training set, or normalize strategy is "none" or "row", just leave normData = NULL. If users want to build test set and the normalize strategy is "column", the normalization data of the corresponding training set generated by this function should be passed to this argument. See examples.

parallel.cores

an integer specifying the number of cores for parallel computation. Default: 2. Set parallel.cores = -1 to run with all the cores. parallel.cores should be == -1 or >= 1.

cl

parallel cores to be passed to this function.

Details

see computeFreq.

Value

If normalize = "none" or normalize = "row", this function will return a data frame. Row names are sequences names, and column names are polymer names. The names of RNA and protein sequences are separated with ".", i.e. row names format: "RNASequenceName.proteinSequenceName" (e.g. "YDL227C.YOR198C"). If featureMode = "combine", the polymers of RNA and protein sequences are also separated with ".", i.e. column format: "RNAPolymerName.proteinPolymerName" (e.g. "aa.CCA").

If normalize = "column", the function will return a list containing features (a data frame named "feature") and normalization values (a list named "normData") for extracting features for test sets.

References

[1] Han S, Yang X, Sun H, et al. LION: an integrated R package for effective prediction of ncRNA–protein interaction. Briefings in Bioinformatics. 2022; 23(6):bbac420

[2] Shen J, Zhang J, Luo X, et al. Predicting protein-protein interactions based only on sequences information. Proc. Natl. Acad. Sci. U. S. A. 2007; 104:4337-41

[3] Muppirala UK, Honavar VG, Dobbs D. Predicting RNA-protein interactions using only sequence information. BMC Bioinformatics 2011; 12:489

[4] Wang Y, Chen X, Liu Z-P, et al. De novo prediction of RNA-protein interactions from sequence information. Mol. BioSyst. 2013; 9:133-142

[5] Akbaripour-Elahabad M, Zahiri J, Rafeh R, et al. rpiCOOL: A tool for In Silico RNA-protein interaction detection using random forest. J. Theor. Biol. 2016; 402:1-8

See Also

computeFreq

Examples

data(demoPositiveSeq)
seqsRNA <- demoPositiveSeq$RNA.positive
seqsPro <- demoPositiveSeq$Pro.positive

dataset1 <- featureFreq(seqRNA = seqsRNA, seqPro = seqsPro, label = "Interact",
                        featureMode = "comb", computePro = "DeNovo", k.Pro = 3,
                        k.RNA = 2, normalize = "row", parallel.cores = 2)

# Training set with normalization on column:

dataset2 <- featureFreq(seqRNA = seqsRNA, seqPro = seqsPro, featureMode = "conc",
                        computePro = "rpiCOOL", k.Pro = 3, k.RNA = 4,
                        normalize = "column", parallel.cores = 2)

# If build a test set with normalization on column,
# "normData" of the corresponding training set (generated by this function) is required:

dataset3 <- featureFreq(seqRNA = seqsRNA, seqPro = seqsPro, featureMode = "conc",
                        computePro = "rpiCOOL", k.Pro = 3, k.RNA = 4,
                        normalize = "column", normData = dataset2$normData,
                        parallel.cores = 2)


HAN-Siyu/ncProR documentation built on Nov. 3, 2023, 12:08 a.m.