make_referFreq: Make Frequencies File for Log.Dist, Euc.Dist, and hexamer...

Description Usage Arguments Details Value References Author(s) See Also Examples

View source: R/LncFinder.R

Description

This function is used to calculate the frequencies of lncRNAs and CDs. The Frequencies file can be used to calculate Logarithm-Distance (compute_LogDistance), Euclidean-Distance (compute_EucDistance), and hexamer score (compute_hexamerScore).

NOTE: If users need to make frequencies file to build new LncFinder classifier using function extract_features, please refer to function make_frequencies.

Usage

1
2
3
4
5
6
7
8
9
make_referFreq(
  cds.seq,
  lncRNA.seq,
  k = 6,
  step = 1,
  alphabet = c("a", "c", "g", "t"),
  on.orf = TRUE,
  ignore.illegal = TRUE
)

Arguments

cds.seq

Coding sequences (mRNA without UTRs). Can be a FASTA file loaded by seqinr-package.

lncRNA.seq

Long non-coding RNA sequences. Can be a FASTA file loaded by seqinr-package.

k

An integer that indicates the sliding window size. (Default: 6)

step

Integer defaulting to 1 for the window step.

alphabet

A vector of single characters that specify the different character of the sequence. (Default: alphabet = c("a", "c", "g", "t"))

on.orf

Logical. Incomplete CDs can lead to a false shift and a inaccurate hexamer frequencies. When on.orf = TRUE, the frequencies will be calculated on the longest ORF. This parameter is strongly recommended to set as TRUE when mRNA is used as CDs. Only available when alphabet = c("a", "c", "g", "t"). (Default: TRUE)

ignore.illegal

Logical. If TRUE, the sequences with non-nucleotide characters (nucleotide characters: "a", "c", "g", "t") will be ignored when calculating the frequencies. Only available when alphabet = c("a", "c", "g", "t"). (Default: TRUE)

Details

This function is used to make frequencies file for the computation of Logarithm-Distance (compute_LogDistance), Euclidean-Distance (compute_EucDistance), and hexamer score (compute_hexamerScore).

In order to achieve high accuracy, mRNA should not be regarded as CDs and assigned to parameter cds.seq. However, CDs of some species may be insufficient for calculating frequencies. In that case, mRNAs can be regarded as CDs with parameter on.orf = TRUE, and the frequencies will be calculated on ORF region. If on.orf = TRUE, users can set step = 3 to simulate the translation process.

Value

Returns a list which consists the frequencies of protein-coding sequences and non-coding sequences.

References

Siyu Han, Yanchun Liang, Qin Ma, Yangyi Xu, Yu Zhang, Wei Du, Cankun Wang & Ying Li. LncFinder: an integrated platform for long non-coding RNA identification utilizing sequence intrinsic composition, structural information, and physicochemical property. Briefings in Bioinformatics, 2019, 20(6):2009-2027.

Author(s)

HAN Siyu

See Also

make_frequencies, compute_LogDistance, compute_EucDistance, compute_hexamerScore.

Examples

1
2
3
4
5
6
7
8
9
## Not run: 
Seqs <- seqinr::read.fasta(file =
"http://www.ncbi.nlm.nih.gov/WebSub/html/help/sample_files/nucleotide-sample.txt")

referFreq <- make_referFreq(cds.seq = Seqs, lncRNA.seq = Seqs, k = 6, step = 1,
                            alphabet = c("a", "c", "g", "t"), on.orf = TRUE,
                            ignore.illegal = TRUE)

## End(Not run)

LncFinder documentation built on Dec. 11, 2021, 9:39 a.m.