buildFeatureVector: build Feature Vector_2

View source: R/03.buildFeatureVector.R

buildFeatureVectorR Documentation

build Feature Vector_2

Description

This function creates a data frame. Fields include peak name, upstream sequence, downstream sequence, and features to be used in classifying the putative polyadenylation site.

Usage

buildFeatureVector(
  peaks,
  genome = Drerio,
  upstream = 40L,
  downstream = 30L,
  wordSize = 6L,
  alphabet = "ACGT",
  sampleType = c("TP", "TN", "unknown"),
  replaceNAdistance = 30L,
  method = c("NaiveBayes", "SVM"),
  fetchSeq = FALSE,
  return_sequences = FALSE
)

Arguments

peaks

An object of GRanges that may contain the upstream and downstream sequence information. This item is created by the function BED6WithSeq2GRangesSeq.

genome

Name of the genome to get sequences from. To find out a list of available genomes, please type BSgenome::available.genomes() in R.

upstream

An integer(1) vector, length of upstream sequence to retrieve.

downstream

An integer(1) vector, length of downstream sequence to retrieve.

wordSize

An integer(1) vector, size of the kmer feature for the upstream sequence. wordSize = 6 should always be used.

alphabet

A character(1) vector, a string containing DNA bases. By default, "ACTG".

sampleType

A character(1) vector, indicating type of sequences for building feature vectors. Options are TP (true positive) and TN (true negative) for training data, or unknown for test data.

replaceNAdistance

An integer(1) vector, specifying an number for avg.distanceA2PeakEnd, the average distance of As to the putative pA site, when there is no A in the downstream sequence.

method

A character(1) vector, specifying a machine learning method to to use. Currently, only "NaiveBayes" is implemented.

fetchSeq

A logical (1), indicating whether upstream and downstream sequences should be retrieved from the BSgenome object at this step or not.

return_sequences

A logical(1) vector, indicating whether upstream and downstream sequences should be included in the output

Value

An object of "featureVector"

Author(s)

Sarah Sheppard, Haibo Liu, Jianhong Ou, Nathan Lawson, Lihua J. Zhu

Examples

library(BSgenome.Drerio.UCSC.danRer7)
testFile <- system.file("extdata", "test.bed",
                        package = "cleanUpdTSeq")
peaks <- BED6WithSeq2GRangesSeq(file = testFile, 
                               skip = 1L, withSeq = TRUE)
## build the feature vector for the test set with sequence information 
testSet.NaiveBayes = buildFeatureVector(peaks,
                                        genome = Drerio, 
                                        upstream = 40L,
                                        downstream = 30L, 
                                        wordSize = 6L, 
                                        alphabet = "ACGT",
                                        sampleType = "unknown",
                                        replaceNAdistance = 30, 
                                        method = "NaiveBayes", 
                                        fetchSeq = FALSE,
                                        return_sequences = TRUE)

## convert the test set to GRanges without upstream and downstream 
## sequence information
peaks <- BED6WithSeq2GRangesSeq(file = testFile, 
                               skip = 1L, withSeq = FALSE)
#build the feature vector for the test set without sequence information
testSet.NaiveBayes = buildFeatureVector(peaks,
                                        genome = Drerio, 
                                        upstream = 40L,
                                        downstream = 30L, 
                                        wordSize = 6L,
                                        alphabet = "ACGT",
                                        sampleType = "unknown",
                                        replaceNAdistance = 30,
                                        method = "NaiveBayes", 
                                        fetchSeq = TRUE,
                                        return_sequences = TRUE)


jianhong/cleanUpdTSeq documentation built on May 8, 2024, 12:13 a.m.