computeFreq: Computation of the K-Mer Frequencies of RNA or Protein...

View source: R/Sequence.R

computeFreqR Documentation

Computation of the K-Mer Frequencies of RNA or Protein Sequences

Description

This function can calculate the k-mer frequencies of RNA or protein sequences. Three kinds of protein representations are available.

Usage

computeFreq(
  seqs,
  seqType = c("RNA", "Pro"),
  computePro = c("RPISeq", "DeNovo", "rpiCOOL"),
  k = 3,
  EDP = FALSE,
  normalize = c("none", "row", "column"),
  normData = NULL,
  parallel.cores = 2,
  cl = NULL
)

Arguments

seqs

sequences loaded by function read.fasta from seqinr-package. Or a list of RNA/protein sequences. RNA sequences will be converted into lower case letters, but protein sequences will be converted into upper case letters. Each sequence should be a vector of single characters.

seqType

a string that specifies the nature of the sequence: "RNA" or "Pro" (protein). If the input is DNA sequence and seqType = "RNA", the DNA sequence will be converted to RNA sequence automatically. Default: "RNA".

computePro

a string that specifies the computation mode of protein sequences: "RPISeq", "DeNovo", or "rpiCOOL". Ignored when seqType = "RNA". Three modes indicate three different amino acid residues classifications that corresponds to methods "RPISeq", "De Novo prediction", and "rpiCOOL". See details below. Default: "RPISeq".

k

an integer that indicates the sliding window step. Default: 3.

EDP

logical. If TRUE, entropy density profile (EDP) will be computed. Default: FALSE.

normalize

can be "none", "row" or "column". Indicate if the frequencies should be normalized. If normalize, should the features be normalized by row (each sequence) or by column (each feature)? See details below. Default: "none".

normData

is the normalization data generated by this function. If the input dataset is training set, or normalize strategy is "none" or "row", just leave normData = NULL. If users want to build test set and the normalize strategy is "column", the normalization data of the corresponding training set generated by this function should be passed to this argument. See examples.

parallel.cores

an integer specifying the number of cores for parallel computation. Default: 2. Set parallel.cores = -1 to run with all the cores. parallel.cores should be == -1 or >= 1.

cl

parallel cores to be passed to this function.

Details

Function computeFreq calculate the k-mer frequencies of RNA/protein sequences. Three computation modes of protein frequencies are:

RPISeq: {A, G, V}, {I, L, F, P}, {Y, M, T, S}, {H, N, Q, W}, {R, K}, {D, E}, {C} (Ref: [3]);

DeNovo: {D, E}, {H, R, K}, {C, G, N, Q, S, T, Y}, {A, F, I, L, M, P, V, W} (Ref: [4]).

rpiCOOL: {A, E}, {I, L, F, M, V}, {N, D, T, S}, {G}, {P}, {R, K, Q, H}, {Y, W}, {C} (Ref: [5]).

If EDP = TRUE, entropy density profile (EDP) will be computed with equation: s_i = -1/H * c_i * log2(c_i), H = -sum(c_j * log2(c_j)). c is the frequencies, i and j represents the indices of k-mer frequencies. (Ref: [6])

The function also provides two normalization strategies: by row (each sequence) or by column (each feature). If by row, the dataset will be processed with equation (Ref: [2]): d_i = (f_i - min{f_1, f_2, ...}) / max{f_1, f_2, ...}. f_1, f_2, ..., f_i are the original values of each row.

If by column, the dataset will be processed with: d_i = (f_i - min{f_1, f_2, ...}) / (max{f_1, f_2, ...} - min{_f1, f_2, ...}).

In [2], normalization is computed by row (each sequence).

Value

If normalize = "none" or normalize = "row", this function will return a data frame. Row names are sequences names, and column names are polymer names.

If normalize = "column", the function will return a list containing features (a data frame named "feature") and normalization values (a list named "normData") for extracting features for test sets.

References

[1] Han S, Yang X, Sun H, et al. LION: an integrated R package for effective prediction of ncRNA–protein interaction. Briefings in Bioinformatics. 2022; 23(6):bbac420

[2] Shen J, Zhang J, Luo X, et al. Predicting protein-protein interactions based only on sequences information. Proc. Natl. Acad. Sci. U. S. A. 2007; 104:4337-41

[3] Muppirala UK, Honavar VG, Dobbs D. Predicting RNA-protein interactions using only sequence information. BMC Bioinformatics 2011; 12:489

[4] Wang Y, Chen X, Liu Z-P, et al. De novo prediction of RNA-protein interactions from sequence information. Mol. BioSyst. 2013; 9:133-142

[5] Akbaripour-Elahabad M, Zahiri J, Rafeh R, et al. rpiCOOL: A tool for In Silico RNA-protein interaction detection using random forest. J. Theor. Biol. 2016; 402:1-8

[6] Yang C, Yang L, Zhou M, et al. LncADeep: an ab initio lncRNA identification and functional annotation tool based on deep learning. Bioinformatics. 2018; 34(22):3825-3834.

See Also

featureFreq

Examples


# Use "read.fasta" function of package "seqinr" to read a FASTA file:

seqs1 <- seqinr::read.fasta(file =
"http://www.ncbi.nlm.nih.gov/WebSub/html/help/sample_files/nucleotide-sample.txt")
seqFreq1 <- computeFreq(seqs1, seqType = "RNA", k = 4, normalize = "row",
                        parallel.cores = 2)

data(demoPositiveSeq)
seqs2 <- demoPositiveSeq$Pro.positive

# Training set with normalization on column:

seqFreq2 <- computeFreq(seqs2, seqType = "Pro", computePro = "RPISeq", k = 3,
                        normalize = "column", parallel.cores = 2)

# If build a test set with normalization on column,
# "normData" of the corresponding training set (generated by this function) is required:

seqFreq3 <- computeFreq(seqs2, seqType = "Pro", computePro = "RPISeq", k = 3,
                        normalize = "column", normData = seqFreq2$normData,
                        parallel.cores = 2)

# If no normalization used:

seqFreq4 <- computeFreq(seqs2, seqType = "Pro", computePro = "DeNovo", k = 3,
                        normalize = "none", parallel.cores = 2)


HAN-Siyu/ncProR documentation built on Nov. 3, 2023, 12:08 a.m.