computeFreq: Computation of the K-Mer Frequencies of RNA or Protein...
In HAN-Siyu/ncProR: Predicting Long non-coding RNA-Protein Interaction

computeFreq

R Documentation

Computation of the K-Mer Frequencies of RNA or Protein Sequences

Description

This function can calculate the k-mer frequencies of RNA or protein sequences. Three kinds of protein representations are available.

Usage

computeFreq(
  seqs,
  seqType = c("RNA", "Pro"),
  computePro = c("RPISeq", "DeNovo", "rpiCOOL"),
  k = 3,
  EDP = FALSE,
  normalize = c("none", "row", "column"),
  normData = NULL,
  parallel.cores = 2,
  cl = NULL
)

Arguments

`seqs`	sequences loaded by function `read.fasta` from `seqinr-package`. Or a list of RNA/protein sequences. RNA sequences will be converted into lower case letters, but protein sequences will be converted into upper case letters. Each sequence should be a vector of single characters.
`seqType`	a string that specifies the nature of the sequence: `"RNA"` or `"Pro"` (protein). If the input is DNA sequence and `seqType = "RNA"`, the DNA sequence will be converted to RNA sequence automatically. Default: `"RNA"`.
`computePro`	a string that specifies the computation mode of protein sequences: `"RPISeq"`, `"DeNovo"`, or `"rpiCOOL"`. Ignored when `seqType = "RNA"`. Three modes indicate three different amino acid residues classifications that corresponds to methods "RPISeq", "De Novo prediction", and "rpiCOOL". See details below. Default: `"RPISeq"`.
`k`	an integer that indicates the sliding window step. Default: `3`.
`EDP`	logical. If `TRUE`, entropy density profile (EDP) will be computed. Default: `FALSE`.
`normalize`	can be `"none"`, `"row"` or `"column"`. Indicate if the frequencies should be normalized. If normalize, should the features be normalized by row (each sequence) or by column (each feature)? See details below. Default: `"none"`.
`normData`	is the normalization data generated by this function. If the input dataset is training set, or normalize strategy is `"none"` or `"row"`, just leave `normData = NULL`. If users want to build test set and the normalize strategy is `"column"`, the normalization data of the corresponding training set generated by this function should be passed to this argument. See examples.
`parallel.cores`	an integer specifying the number of cores for parallel computation. Default: `2`. Set `parallel.cores = -1` to run with all the cores. `parallel.cores` should be == -1 or >= 1.
`cl`	parallel cores to be passed to this function.

Details

Function computeFreq calculate the k-mer frequencies of RNA/protein sequences. Three computation modes of protein frequencies are:

RPISeq: {A, G, V}, {I, L, F, P}, {Y, M, T, S}, {H, N, Q, W}, {R, K}, {D, E}, {C} (Ref: [3]);

DeNovo: {D, E}, {H, R, K}, {C, G, N, Q, S, T, Y}, {A, F, I, L, M, P, V, W} (Ref: [4]).

rpiCOOL: {A, E}, {I, L, F, M, V}, {N, D, T, S}, {G}, {P}, {R, K, Q, H}, {Y, W}, {C} (Ref: [5]).

If EDP = TRUE, entropy density profile (EDP) will be computed with equation: s_i = -1/H * c_i * log2(c_i), H = -sum(c_j * log2(c_j)). c is the frequencies, i and j represents the indices of k-mer frequencies. (Ref: [6])

The function also provides two normalization strategies: by row (each sequence) or by column (each feature). If by row, the dataset will be processed with equation (Ref: [2]): d_i = (f_i - min{f_1, f_2, ...}) / max{f_1, f_2, ...}. f_1, f_2, ..., f_i are the original values of each row.

If by column, the dataset will be processed with: d_i = (f_i - min{f_1, f_2, ...}) / (max{f_1, f_2, ...} - min{_f1, f_2, ...}).

In [2], normalization is computed by row (each sequence).

Value

If normalize = "none" or normalize = "row", this function will return a data frame. Row names are sequences names, and column names are polymer names.

If normalize = "column", the function will return a list containing features (a data frame named "feature") and normalization values (a list named "normData") for extracting features for test sets.

References

[1] Han S, Yang X, Sun H, et al. LION: an integrated R package for effective prediction of ncRNA–protein interaction. Briefings in Bioinformatics. 2022; 23(6):bbac420

[2] Shen J, Zhang J, Luo X, et al. Predicting protein-protein interactions based only on sequences information. Proc. Natl. Acad. Sci. U. S. A. 2007; 104:4337-41

[3] Muppirala UK, Honavar VG, Dobbs D. Predicting RNA-protein interactions using only sequence information. BMC Bioinformatics 2011; 12:489

[4] Wang Y, Chen X, Liu Z-P, et al. De novo prediction of RNA-protein interactions from sequence information. Mol. BioSyst. 2013; 9:133-142

[5] Akbaripour-Elahabad M, Zahiri J, Rafeh R, et al. rpiCOOL: A tool for In Silico RNA-protein interaction detection using random forest. J. Theor. Biol. 2016; 402:1-8

[6] Yang C, Yang L, Zhou M, et al. LncADeep: an ab initio lncRNA identification and functional annotation tool based on deep learning. Bioinformatics. 2018; 34(22):3825-3834.

Examples


# Use "read.fasta" function of package "seqinr" to read a FASTA file:

seqs1 <- seqinr::read.fasta(file =
"http://www.ncbi.nlm.nih.gov/WebSub/html/help/sample_files/nucleotide-sample.txt")
seqFreq1 <- computeFreq(seqs1, seqType = "RNA", k = 4, normalize = "row",
                        parallel.cores = 2)

data(demoPositiveSeq)
seqs2 <- demoPositiveSeq$Pro.positive

# Training set with normalization on column:

seqFreq2 <- computeFreq(seqs2, seqType = "Pro", computePro = "RPISeq", k = 3,
                        normalize = "column", parallel.cores = 2)

# If build a test set with normalization on column,
# "normData" of the corresponding training set (generated by this function) is required:

seqFreq3 <- computeFreq(seqs2, seqType = "Pro", computePro = "RPISeq", k = 3,
                        normalize = "column", normData = seqFreq2$normData,
                        parallel.cores = 2)

# If no normalization used:

seqFreq4 <- computeFreq(seqs2, seqType = "Pro", computePro = "DeNovo", k = 3,
                        normalize = "none", parallel.cores = 2)

HAN-Siyu/ncProR documentation built on Nov. 3, 2023, 12:08 a.m.