compute_hexamerScore: Compute Hexamer Score
In LncFinder: LncRNA Identification and Analysis Using Heterologous Features

compute_hexamerScore

R Documentation

Compute Hexamer Score

Description

This function can compute hexamer score proposed by method CPAT (Wang et al. 2013). Hexamer score can be calculated on full sequence or the longest ORF region. The step and k of the sliding window can also be customized.

Usage

compute_hexamerScore(
  Sequences,
  label = NULL,
  referFreq,
  k = 6,
  step = 1,
  alphabet = c("a", "c", "g", "t"),
  on.ORF = FALSE,
  auto.full = FALSE,
  parallel.cores = 2
)

Arguments

`Sequences`	A FASTA file loaded by function `read.fasta` of `seqinr-package`.
`label`	Optional. String. Indicate the label of the sequences such as "NonCoding", "Coding".
`referFreq`	A list obtained from function `make_referFreq`.
`k`	An integer that indicates the sliding window size. (Default: `6`)
`step`	Integer defaulting to `1` for the window step.
`alphabet`	A vector of single characters that specify the different character of the sequence. (Default: `alphabet = c("a", "c", "g", "t")`)
`on.ORF`	Logical. If `TRUE`, hexamer score will be calculated on the longest ORF region. NOTE: If `TRUE`, the input has to be DNA sequences. (Default: `FALSE`)
`auto.full`	Logical. When `on.ORF = TRUE` but no ORF can be found, if `auto.full = TRUE`, hexamer score will be calculated on full sequences automatically; if `auto.full` is `FALSE`, the sequences that have no ORF will be discarded. Ignored when `on.ORF = FALSE`. (Default: `FALSE`)
`parallel.cores`	Integer. The number of cores for parallel computation. By default the number of cores is `2`. Users can set as `-1` to run this function with all cores.

Details

This function can compute hexamer score proposed by CPAT (Wang et al. 2013). In CPAT, hexamer score is calculated on the longest ORF region, and the step of the sliding window is 3 (i.e. step = 3). Hexamer means six adjoining bases, thus k = 6. But in function compute_hexamerScore, both step, k, and calculated region (full sequence or ORF) can be customized to maximize its availability.

Value

A dataframe.

References

Liguo Wang, Hyun Jung Park, Surendra Dasari, Shengqin Wang, JeanPierre Kocher, & Wei Li. CPAT: coding-potential assessment tool using an alignment-free logistic regression model. Nucleic Acids Research, 2013, 41(6):e74-e74.

Siyu Han, Yanchun Liang, Qin Ma, Yangyi Xu, Yu Zhang, Wei Du, Cankun Wang & Ying Li. LncFinder: an integrated platform for long non-coding RNA identification utilizing sequence intrinsic composition, structural information, and physicochemical property. Briefings in Bioinformatics, 2019, 20(6):2009-2027.

Author(s)

HAN Siyu

Examples

## Not run: 
Seqs <- seqinr::read.fasta(file =
"http://www.ncbi.nlm.nih.gov/WebSub/html/help/sample_files/nucleotide-sample.txt")

referFreq <- make_referFreq(cds.seq = Seqs, lncRNA.seq = Seqs, k = 6, step = 1,
                            alphabet = c("a", "c", "g", "t"), on.orf = TRUE,
                            ignore.illegal = TRUE)

data(demo_DNA.seq)
Sequences <- demo_DNA.seq

hexamerScore <- compute_hexamerScore(Sequences, label = "NonCoding", referFreq = referFreq,
                                     k = 6, step = 1, alphabet = c("a", "c", "g", "t"),
                                     on.ORF = TRUE, auto.full = TRUE, parallel.cores = 2)

## End(Not run)

LncFinder documentation built on Sept. 29, 2024, 1:06 a.m.