Classifying(/scoring) new sequences using the gkmSVM model

Share:

Description

Given support vectors SVs and corresponding coefficients alphas and a set of sequences, calculates the SVM scores for the sequences.

Usage

1
2
3
4
gkmsvm_classify(seqfile, svmfnprfx, outfile, L=10, K=6, maxnmm=3, 
maxseqlen=10000, maxnumseq=1000000, useTgkm=1, alg=0, addRC=TRUE, usePseudocnt=FALSE, 
batchSize=100000, wildcardLambda=1.0, wildcardMismatchM=2, alphabetFN="NULL", 
svseqfile=NA, alphafile=NA)

Arguments

seqfile

input sequences file name (FASTA format)

svmfnprfx

SVM model file name prefix

outfile

output file name

L

word length, default=10

K

number of informative columns, default=6

maxnmm

maximum number of mismatches to consider, default=3

maxseqlen

maximum sequence length in the sequence files, default=10000

maxnumseq

maximum number of sequences in the sequence files, default=1000000

useTgkm

filter type: 0(use full filter), 1(use truncated filter: this gaurantees non-negative counts for all L-mers), 2(use h[m], gkm count vector), 3(wildcard), 4(mismatch), default=1

alg

algorithm type: 0(auto), 1(XOR Hashtable), 2(tree), default=0

addRC

adds reverse complement sequences, default=TRUE

usePseudocnt

adds a constant to count estimates, default=FALSE

batchSize

number of sequences to compute scores for in batch, default=100000

wildcardLambda

lambda for wildcard kernel, defaul=0.9

wildcardMismatchM

max mismatch for Mismatch kernel or wildcard kernel, default=2

alphabetFN

alphabets file name, if not specified, it is assumed the inputs are DNA sequences

svseqfile

SVM support vectors sequence file name (not needed if svmfnprfx is provided)

alphafile

SVM support vectors weights file name (not needed if svmfnprfx is provided)

Details

classification using SVM: gkmsvm_classify can be used to score any set of sequences. Note that the same set of parameters (L, K, maxnmm) used in the gkmsvm_kernel should be specified for optimal classification.

gkmsvm_classify(testfn, svmfnprfx, outfn); #scores test sequences

Author(s)

Mahmoud Ghandi

Examples

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
  #Input file names:  
  posfn= 'test_positives.fa'   #positive set (FASTA format)
  negfn= 'test_negatives.fa'   #negative set (FASTA format)
  testfn= 'test_testset.fa'    #test set (FASTA format)
  
  #Output file names:  
  kernelfn= 'test_kernel.txt' #kernel matrix
  svmfnprfx= 'test_svmtrain'  #SVM files 
  outfn =   'output.txt'      #output scores for sequences in the test set       

#  gkmsvm_kernel(posfn, negfn, kernelfn);                #computes kernel 
#  gkmsvm_train(kernelfn,posfn, negfn, svmfnprfx);       #trains SVM
#  gkmsvm_classify(testfn, svmfnprfx, outfn);            #scores test sequences