make_frequencies: Make the frequencies file for new classifier construction

Description Usage Arguments Details Value References Author(s) See Also Examples

View source: R/LncFinder.R

Description

This function is used to calculate the frequencies of lncRNAs, CDs, and secondary structure sequences. The frequencies file can be used to build the classifier using function extract_features. Functions make_frequencies and extract_features are useful when users are trying to build their own model.

NOTE: Function make_frequencies makes the frequencies file for building the classifiers of LncFinder method. If users need to calculate Logarithm-Distance, Euclidean-Distance, and hexamer score, the frequencies file need to be computed using function make_referFreq.

Usage

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
make_frequencies(
  cds.seq,
  mRNA.seq,
  lncRNA.seq,
  SS.features = FALSE,
  cds.format = "DNA",
  lnc.format = "DNA",
  check.cds = TRUE,
  ignore.illegal = TRUE
)

Arguments

cds.seq

Coding sequences (mRNA without UTRs). Can be a FASTA file loaded by seqinr-package or secondary structure sequences (Dot-Bracket Notation) obtained form function run_RNAfold. CDs are used to calculate hexamer frequencies of nucleotide sequences,thus secondary structure is not needed. Parameter cds.format should be "SS" when input is secondary structure sequences. (See details for more information.)

mRNA.seq

mRNA sequences with Dot-Bracket Notation. The secondary structure sequences can be obtained from function run_RNAfold. mRNA sequences are used to calculate the frequencies of acgu-ACGU and a acguD (see details), thus, mRNA sequences are required only when SS.features = TRUE.

lncRNA.seq

Long non-coding RNA sequences. Can be a FASTA file loaded by seqinr-package or secondary structure sequences (Dot-Bracket Notation) obtained from function run_RNAfold. If SS.features = TRUE, lncRNA.seq must be RNA sequences with secondary structure sequences and parameter lnc.format should be defined as "SS".

SS.features

Logical. If SS.features = TRUE, frequencies of secondary structure will also be calculated and the model can be built with secondary structure features. In this case, mRNA.seq and lncRNA.seq should be secondary structure sequences.

cds.format

String. Define the format of the sequences of cds.seq. Can be "DNA" or "SS". "DNA" for DNA sequences and "SS" for secondary structure sequences.

lnc.format

String. Define the format of lncRNAs (lncRNA.seq). Can be "DNA" or "SS". "DNA" for DNA sequences and "SS" for secondary structure sequences. This parameter must be defined as "SS" when SS.features = TURE.

check.cds

Logical. Incomplete CDs can lead to a false shift and a inaccurate hexamer frequencies. When check.cds = TRUE, hexamer frequencies will be calculated on the longest ORF. This parameter is strongly recommended to set as TRUE when mRNA is used as CDs.

ignore.illegal

Logical. If TRUE, the sequences with non-nucleotide characters (nucleotide characters: "a", "c", "g", "t") will be ignored when calculating hexamer frequencies.

Details

This function is used to make frequencies file for LncFinder method. This file is needed when users are trying to build their own model.

In order to achieve high accuracy, mRNA should not be regarded as CDs and assigned to parameter cds.seq. However, CDs of some species may be insufficient for calculating frequencies, and mRNAs can be regarded as CDs with parameter check.cds = TRUE. In this case, hexamer frequencies will be calculated on ORF region.

Considering that it is time consuming to obtain secondary structure sequences, users can only provide nucleotide sequences and build a model without secondary structure features (SS.features = FALSE). If users want to build a model with secondary structure features, parameter SS.features should be set as TRUE. At the same time, the format of the sequences of mRNA.seq and lnc.seq should be secondary structure sequences (Dot-Bracket Notation). Secondary structure sequences can be obtained by function run_RNAfold.

Please note that:

SS.features can improve the performance when the species of unevaluated sequences is identical to the species of the sequences that used to build the model.

However, if users are trying to predict sequences with the model trained on other species, SS.features may lead to low accuracy.

The frequencies file consists three groups: Hexamer Frequencies; acgu-ACGU Frequencies and acguD Frequencies.

Hexamer Frequencies are calculated on the original nucleotide sequences by employing k-mer scheme (k = 6), and the sliding window will slide 3 nt each step.

For any secondary structure sequences (Dot-Bracket Notation), if one position is a dot, the corresponding nucleotide of the RNA sequence will be replaced with character "D". acguD Frequencies are the k-mer frequencies (k = 4) calculated on this new sequences.

Similarly, for any secondary structure sequences (Dot-Bracket Notation), if one position is "(" or ")", the corresponding nucleotide of the RNA sequence will be replaced with upper case ("A", "C", "G", "U").

A brief example,

DNA Sequence: 5'- t a c a g t t a t g -3'

RNA Sequence: 5'- u a c a g u u a u g -3'

Dot-Bracket Sequence: 5'- . . . . ( ( ( ( ( ( -3'

acguD Sequence: { D, D, D, D, g, u, u, a, u, g }

acgu-ACGU Sequence: { u, a, c, a, G, U, U, A, U, G }

Value

Returns a list which consists the frequencies of protein-coding sequences and non-coding sequences.

References

Siyu Han, Yanchun Liang, Qin Ma, Yangyi Xu, Yu Zhang, Wei Du, Cankun Wang & Ying Li. LncFinder: an integrated platform for long non-coding RNA identification utilizing sequence intrinsic composition, structural information, and physicochemical property. Briefings in Bioinformatics, 2019, 20(6):2009-2027.

Author(s)

HAN Siyu

See Also

run_RNAfold, read_SS, build_model, extract_features, make_referFreq.

Examples

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
### Only for examples:
data(demo_DNA.seq)
Seqs <- demo_DNA.seq

## Not run: 
### Obtain the secondary structure sequences (Windows OS):
RNAfold.path <- '"E:/Program Files/ViennaRNA/RNAfold.exe"'
SS.seq <- run_RNAfold(Seqs, RNAfold.path = RNAfold.path, parallel.cores = 2)

### Make frequencies file with secondary strucutre features,
my_file_1 <- make_frequencies(cds.seq = SS.seq, mRNA.seq = SS.seq,
                              lncRNA.seq = SS.seq, SS.features = TRUE,
                              cds.format = "SS", lnc.format = "SS",
                              check.cds = TRUE, ignore.illegal = FALSE)

## End(Not run)

### Make frequencies file without secondary strucutre features,
my_file_2 <- make_frequencies(cds.seq = Seqs, lncRNA.seq = Seqs,
                              SS.features = FALSE, cds.format = "DNA",
                              lnc.format = "DNA", check.cds = TRUE,
                              ignore.illegal = FALSE)

### The input of cds.seq and lncRNA.seq can also be secondary structure
### sequences when SS.features = FALSE, such as,
data(demp_SS.seq)
SS.seq <- demo_SS.seq
my_file_3 <- make_frequencies(cds.seq = SS.seq, lncRNA.seq = Seqs,
                              SS.features = FALSE, cds.format = "SS",
                              lnc.format = "DNA", check.cds = TRUE,
                              ignore.illegal = FALSE)

LncFinder documentation built on Dec. 11, 2021, 9:39 a.m.