extract_features: Extract the Features

Description Usage Arguments Details Value Features References Author(s) See Also Examples

View source: R/LncFinder.R

Description

This function can construct the dataset. This function is only used to extract the features, please use function build_model to build new models.

Usage

1
2
3
4
5
6
7
8
extract_features(
  Sequences,
  label = NULL,
  SS.features = FALSE,
  format = "DNA",
  frequencies.file = "human",
  parallel.cores = 2
)

Arguments

Sequences

mRNA sequences or long non-coding sequences. Can be a FASTA file loaded by seqinr-package or secondary structure sequences (Dot-Bracket Notation) obtained from function run_RNAfold. If Sequences are secondary structure sequences file, parameter format should be defined as "SS".

label

Optional. String. Indicate the label of the sequences such as "NonCoding", "Coding".

SS.features

Logical. If SS.features = TRUE, secondary structure features will be extracted. In this case, Sequences should be secondary structure sequences (Dot-Bracket Notation) obtained from function run_RNAfold and parameter format should be set as "SS".

format

String. Can be "DNA" or "SS". Define the format of Sequences. "DNA" for DNA sequences and "SS" for secondary structure sequences. This parameter must be set as "SS" when SS.features = TURE.

frequencies.file

String or a list obtained from function make_frequencies. Input species name "human", "mouse" or "wheat" to use pre-build frequencies files. Or assign a users' own frequencies file (See function make_frequencies).

parallel.cores

Integer. The number of cores for parallel computation. By default the number of cores is 2. Users can set as -1 to run this function with all cores.

Details

This function extracts the features and constructs the dataset.

Considering that it is time consuming to obtain secondary structure sequences, users can build the model only with features of sequence and EIIP (SS.features = FALSE). When SS.features = TRUE, Sequences should be secondary structure sequences (Dot-Bracket Notation) obtained from function run_RNAfold and parameter format should be set as "SS".

Please note that:

Secondary structure features (SS.features) can improve the performance when the species of unevaluated sequences is identical to the species of the sequences that used to build the model.

However, if users are trying to predict sequences with the model trained on other species, SS.features as TRUE may lead to low accuracy.

Value

Returns a data.frame. 11 features when SS.features is FALSE, and 19 features when SS.features is TRUE.

Features

1. Features based on sequence:

The length and coverage of the longest ORF (ORF.Max.Len and ORF.Max.Cov);

Log-Distance.lncRNA (Seq.lnc.Dist);

Log-Distance.protein-coding transcripts (Seq.pct.Dist);

Distance-Ratio.sequence (Seq.Dist.Ratio).

2. Features based on EIIP (electron-ion interaction pseudopotential) value:

Signal at 1/3 position (Signal.Peak);

Signal to noise ratio (SNR);

the minimum value of the top 10% power spectrum (Signal.Min);

the quantile Q1 and Q2 of the top 10% power spectrum (Singal.Q1 and Signal.Q2)

the maximum value of the top 10% power spectrum (Signal.Max).

3. Features based on secondary structure sequence:

Log-Distance.acguD.lncRNA (Dot_lnc.dist);

Log-Distance.acguD.protein-coding transcripts (Dot_pct.dist);

Distance-Ratio.acguD (Dot_Dist.Ratio);

Log-Distance.acgu-ACGU.lncRNA (SS.lnc.dist);

Log-Distance.acgu-ACGU.protein-coding transcripts (SS.pct.dist);

Distance-Ratio.acgu-ACGU (SS.Dist.Ratio);

Minimum free energy (MFE);

Percentage of Unpair-Pair (UP.PCT)

References

Siyu Han, Yanchun Liang, Qin Ma, Yangyi Xu, Yu Zhang, Wei Du, Cankun Wang & Ying Li. LncFinder: an integrated platform for long non-coding RNA identification utilizing sequence intrinsic composition, structural information, and physicochemical property. Briefings in Bioinformatics, 2019, 20(6):2009-2027.

Author(s)

HAN Siyu

See Also

svm_tune, build_model, make_frequencies, run_RNAfold, read_SS.

Examples

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
## Not run: 
data(demo_DNA.seq)
Seqs <- demo_DNA.seq

### Extract features with pre-build frequencies.file:
my_features <- extract_features(Seqs, label = "Class.of.the.Sequences",
                                SS.features = FALSE, format = "DNA",
                                frequencies.file = "mouse",
                                parallel.cores = 2)

### Use your own frequencies file by assign frequencies list to parameter
### "frequencies.file".

## End(Not run)

LncFinder documentation built on Dec. 11, 2021, 9:39 a.m.