computeFreq | R Documentation |
This function can calculate the k-mer frequencies of RNA or protein sequences. Three kinds of protein representations are available.
computeFreq(
seqs,
seqType = c("RNA", "Pro"),
computePro = c("RPISeq", "DeNovo", "rpiCOOL"),
k = 3,
EDP = FALSE,
normalize = c("none", "row", "column"),
normData = NULL,
parallel.cores = 2,
cl = NULL
)
seqs |
sequences loaded by function |
seqType |
a string that specifies the nature of the sequence: |
computePro |
a string that specifies the computation mode of protein sequences: |
k |
an integer that indicates the sliding window step. Default: |
EDP |
logical. If |
normalize |
can be |
normData |
is the normalization data generated by this function.
If the input dataset is training set, or normalize strategy is |
parallel.cores |
an integer specifying the number of cores for parallel computation. Default: |
cl |
parallel cores to be passed to this function. |
Function computeFreq
calculate the k-mer frequencies of RNA/protein sequences. Three computation modes
of protein frequencies are:
RPISeq
:
{A, G, V}, {I, L, F, P}, {Y, M, T, S}, {H, N, Q, W}, {R, K}, {D, E}, {C}
(Ref: [3]);
DeNovo
:
{D, E}, {H, R, K}, {C, G, N, Q, S, T, Y}, {A, F, I, L, M, P, V, W}
(Ref: [4]).
rpiCOOL
:
{A, E}, {I, L, F, M, V}, {N, D, T, S}, {G}, {P}, {R, K, Q, H}, {Y, W}, {C}
(Ref: [5]).
If EDP = TRUE
, entropy density profile (EDP) will be computed with equation:
s_i = -1/H * c_i * log2(c_i), H = -sum(c_j * log2(c_j)).
c is the frequencies, i and j represents the indices of k-mer frequencies. (Ref: [6])
The function also provides two normalization strategies: by row (each sequence) or by column (each feature). If by row, the dataset will be processed with equation (Ref: [2]): d_i = (f_i - min{f_1, f_2, ...}) / max{f_1, f_2, ...}. f_1, f_2, ..., f_i are the original values of each row.
If by column, the dataset will be processed with: d_i = (f_i - min{f_1, f_2, ...}) / (max{f_1, f_2, ...} - min{_f1, f_2, ...}).
In [2], normalization is computed by row (each sequence).
If normalize = "none"
or normalize = "row"
, this function will return a data frame.
Row names are sequences names, and column names are polymer names.
If normalize = "column"
, the function will return a list containing features (a data frame named "feature")
and normalization values (a list named "normData") for extracting features for test sets.
[1] Han S, Yang X, Sun H, et al. LION: an integrated R package for effective prediction of ncRNA–protein interaction. Briefings in Bioinformatics. 2022; 23(6):bbac420
[2] Shen J, Zhang J, Luo X, et al. Predicting protein-protein interactions based only on sequences information. Proc. Natl. Acad. Sci. U. S. A. 2007; 104:4337-41
[3] Muppirala UK, Honavar VG, Dobbs D. Predicting RNA-protein interactions using only sequence information. BMC Bioinformatics 2011; 12:489
[4] Wang Y, Chen X, Liu Z-P, et al. De novo prediction of RNA-protein interactions from sequence information. Mol. BioSyst. 2013; 9:133-142
[5] Akbaripour-Elahabad M, Zahiri J, Rafeh R, et al. rpiCOOL: A tool for In Silico RNA-protein interaction detection using random forest. J. Theor. Biol. 2016; 402:1-8
[6] Yang C, Yang L, Zhou M, et al. LncADeep: an ab initio lncRNA identification and functional annotation tool based on deep learning. Bioinformatics. 2018; 34(22):3825-3834.
featureFreq
# Use "read.fasta" function of package "seqinr" to read a FASTA file:
seqs1 <- seqinr::read.fasta(file =
"http://www.ncbi.nlm.nih.gov/WebSub/html/help/sample_files/nucleotide-sample.txt")
seqFreq1 <- computeFreq(seqs1, seqType = "RNA", k = 4, normalize = "row",
parallel.cores = 2)
data(demoPositiveSeq)
seqs2 <- demoPositiveSeq$Pro.positive
# Training set with normalization on column:
seqFreq2 <- computeFreq(seqs2, seqType = "Pro", computePro = "RPISeq", k = 3,
normalize = "column", parallel.cores = 2)
# If build a test set with normalization on column,
# "normData" of the corresponding training set (generated by this function) is required:
seqFreq3 <- computeFreq(seqs2, seqType = "Pro", computePro = "RPISeq", k = 3,
normalize = "column", normData = seqFreq2$normData,
parallel.cores = 2)
# If no normalization used:
seqFreq4 <- computeFreq(seqs2, seqType = "Pro", computePro = "DeNovo", k = 3,
normalize = "none", parallel.cores = 2)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.