genNullSeqs | R Documentation |
Generates null sequences (negative set) with matching repeat and GC content as the input bed file for positive set regions.
genNullSeqs(
inputBedFN,
genomeVersion='hg19',
outputBedFN = 'negSet.bed',
outputPosFastaFN = 'posSet.fa',
outputNegFastaFN = 'negSet.fa',
xfold = 1,
repeat_match_tol = 0.02,
GC_match_tol = 0.02,
length_match_tol = 0.02,
batchsize = 5000,
nMaxTrials = 20,
genome = NULL)
inputBedFN |
positive set regions |
genomeVersion |
genome version: 'hg19' and 'hg18' are supported. Default='hg19'. For other genomes, provide the BSgenome object using parameter 'genome' |
outputBedFN |
output file name for the null sequences genomic regions. Default='negSet.bed' |
outputPosFastaFN |
output file name for the positive set sequences. Default='posSet.fa' |
outputNegFastaFN |
output file name for the negative set sequences. Default='negSet.fa' |
xfold |
controls the desired number of sequences in the negative set. Default=1 (same number as in positive set) |
repeat_match_tol |
tolerance for difference in repeat ratio. Default=0.02 (repeat content difference of 0.02 or less is acceptable) |
GC_match_tol |
tolerance for difference in GC content. Default=0.02 |
length_match_tol |
tolerance for difference in relative sequence length. Default=0.02 |
batchsize |
number of candidate random sequences tested in each trial. Default=5000 |
nMaxTrials |
maximum number of trials. Default=20. |
genome |
BSgenome object. Default=NULL. If this parameter is used, parameter genomeVersion is ignored. |
Writes the null sequences to files with the provided filenames. Outputs the filename for the output negative sequences file.
Mahmoud Ghandi
# Example 1:
# genNullSeqs('ctcfpos.bed' );
#Example 2:
# genNullSeqs('ctcfpos.bed', nMaxTrials=3, xfold=2, genomeVersion = 'hg18' );
#Example 3:
# genNullSeqs('ctcfpos.bed', xfold=2, genomeVersion = 'hg18', outputBedFN = 'ctcf_negSet.bed',
# outputPosFastaFN = 'ctcf_posSet.fa',outputNegFastaFN = 'ctcf_negSet.fa' );
#Example 4:
# Input file names:
posBedFN = 'test_positives.bed' # positive set genomic ranges (bed format)
genomeVer = 'hg19' #genome version
testfn= 'test_testset.fa' #test set (FASTA format)
# output file names:
posfn= 'test_positives.fa' #positive set (FASTA format)
negfn= 'test_negatives.fa' #negative set (FASTA format)
kernelfn= 'test_kernel.txt' #kernel matrix
svmfnprfx= 'test_svmtrain' #SVM files
outfn = 'output.txt' #output scores for sequences in the test set
# genNullSeqs(posBedFN, genomeVersion = genomeVer,
# outputPosFastaFN = posfn, outputNegFastaFN = negfn );
# gkmsvm_kernel(posfn, negfn, kernelfn); #computes kernel
# gkmsvm_train(kernelfn, posfn, negfn, svmfnprfx); #trains SVM
# gkmsvm_classify(testfn, svmfnprfx, outfn); #scores test sequences
# using L=18, K=7, maxnmm=4
# gkmsvm_kernel(posfn, negfn, kernelfn, L=18, K=7, maxnmm=4); #computes kernel
# gkmsvm_train(kernelfn, posfn, negfn, svmfnprfx); #trains SVM
# gkmsvm_classify(testfn, svmfnprfx, outfn, L=18, K=7, maxnmm=4); #scores test sequences
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.