encodeSeqShape: Encode k-mer DNA sequence and n-th order DNA Shape features

Description Usage Arguments Value Author(s) Examples

View source: R/encodeSeqShape.R

Description

DNAshapeR can be used to generate feature vectors for a user-defined model. These models can be based on DNA sequence (1-mer, 2-mer, 3-mer) or DNA shape (MGW, Roll, ProT, HelT) features or any combination thereof. Sequence is encoded as four binary features (i.e., 0001 for adenine, 0010 for cytosine, 0100 for guanine, and 1000 for thymine, for encoding of 1-mers) at each nucleotide position (Zhou, et al., 2015). Encoding of 2-mers and 3-mers (16 and 64 binary features at each position, respectively) is also supported. Shape features include first and second order (or higher order) values for the four structural parameters MGW, Roll, ProT and HelT. The second order shape features are product terms of values for the same category of shape features at adjacent positions. The function allows to generate any subset of these features, e.g. a given shape category or first order shape features, and any desired combination of shape and sequence features. Feature encoding returns a feature matrix for a dataset of multiple sequences, in which each sequence generates a concatenated feature vector. The output of this function can be used directly for any statistical machine learning method.

Usage

1
encodeSeqShape(fastaFileName, shapeMatrix, featureNames, normalize)

Arguments

fastaFileName

A character name of the input fasta format file, including full path to file if it is located outside the current working directory.

shapeMatrix

A matrix containing DNAshape prediction result

featureNames

A vector containing a combination of user-defined sequence and shape parameters. The parameters can be any combination of "k-mer", "n-shape", "n-MGW", "n-ProT", "n-Roll", "n-HelT" (k, n are integers)

normalize

A logical indicating whether to perform normalization. Default to TRUE.

Value

featureVector A matrix containing encoded features. Sequence features are represented as binary numbers, while shape features are represented as real numbers.

Author(s)

Tsu-Pei Chiu

Examples

1
2
3
4
fn <- system.file("extdata", "CGRsample_short.fa", package = "DNAshapeR")
pred <- getShape(fn)
featureNames <- c("1-shape")
featureVector <- encodeSeqShape(fn, pred, featureNames)

DNAshapeR documentation built on Nov. 8, 2020, 8:04 p.m.