Generating GC/repeat matched randomly selected genomic sequences for the negative set

Share:

Description

Generates null sequences (negative set) with matching repeat and GC content as the input bed file for positive set regions.

Usage

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
genNullSeqs(
  inputBedFN, 
  genomeVersion='hg19', 
  outputBedFN = 'negSet.bed', 
  outputPosFastaFN = 'posSet.fa',
  outputNegFastaFN = 'negSet.fa', 
  xfold = 1, 
  repeat_match_tol = 0.02, 
  GC_match_tol = 0.02, 
  length_match_tol = 0.02, 
  batchsize = 5000, 
  nMaxTrials = 20, 
  genome = NULL)

Arguments

inputBedFN

positive set regions

genomeVersion

genome version: 'hg19' and 'hg18' are supported. Default='hg19'. For other genomes, provide the BSgenome object using parameter 'genome'

outputBedFN

output file name for the null sequences genomic regions. Default='negSet.bed'

outputPosFastaFN

output file name for the positive set sequences. Default='posSet.fa'

outputNegFastaFN

output file name for the negative set sequences. Default='negSet.fa'

xfold

controls the desired number of sequences in the negative set. Default=1 (same number as in positive set)

repeat_match_tol

tolerance for difference in repeat ratio. Default=0.02 (repeat content difference of 0.02 or less is acceptable)

GC_match_tol

tolerance for difference in GC content. Default=0.02

length_match_tol

tolerance for difference in relative sequence length. Default=0.02

batchsize

number of candidate random sequences tested in each trial. Default=5000

nMaxTrials

maximum number of trials. Default=20.

genome

BSgenome object. Default=NULL. If this parameter is used, parameter genomeVersion is ignored.

Value

Writes the null sequences to files with the provided filenames. Outputs the filename for the output negative sequences file.

Author(s)

Mahmoud Ghandi

Examples

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
# Example 1: 
# genNullSeqs('ctcfpos.bed' );

#Example 2:
# genNullSeqs('ctcfpos.bed', nMaxTrials=3, xfold=2, genomeVersion = 'hg18' );

#Example 3:
# genNullSeqs('ctcfpos.bed', xfold=2, genomeVersion = 'hg18', outputBedFN = 'ctcf_negSet.bed',
# outputPosFastaFN = 'ctcf_posSet.fa',outputNegFastaFN = 'ctcf_negSet.fa' );

#Example 4:
# Input file names:  
  
  posBedFN = 'test_positives.bed' # positive set genomic ranges (bed format)
  genomeVer = 'hg19' #genome version 
  testfn= 'test_testset.fa'    #test set (FASTA format)
  
# output file names:  
  posfn= 'test_positives.fa'   #positive set (FASTA format)
  negfn= 'test_negatives.fa'   #negative set (FASTA format)
  kernelfn= 'test_kernel.txt' #kernel matrix
  svmfnprfx= 'test_svmtrain'  #SVM files 
  outfn =   'output.txt'      #output scores for sequences in the test set       

#  genNullSeqs(posBedFN, genomeVersion = genomeVer, 
#    outputPosFastaFN = posfn, outputNegFastaFN = negfn );

#  gkmsvm_kernel(posfn, negfn, kernelfn);                #computes kernel 
#  gkmsvm_train(kernelfn, posfn, negfn, svmfnprfx);       #trains SVM
#  gkmsvm_classify(testfn, svmfnprfx, outfn);            #scores test sequences 

#  using L=18, K=7, maxnmm=4

#  gkmsvm_kernel(posfn, negfn, kernelfn, L=18, K=7, maxnmm=4);     #computes kernel 
#  gkmsvm_train(kernelfn, posfn, negfn, svmfnprfx);                 #trains SVM
#  gkmsvm_classify(testfn, svmfnprfx, outfn, L=18, K=7, maxnmm=4); #scores test sequences