Description Usage Arguments Details Author(s) Examples
View source: R/featureEvaluate.R
Feature sets from different feature coding schemas are used as input of classification models, and the model performance are given in the result.
1 2 3 4 5 6 7 8 9 | featureEvaluate(seq, classLable, fileName, ele.type, featureMethod,
cv=10, classifyMethod="libsvm",
group=c("aaH", "aaV", "aaZ", "aaP", "aaF", "aaS", "aaE"), k, g,
hydro.methods=c("kpm", "SARAH1"), hydro.indexs=c("hydroE", "hydroF", "hydroC"),
aaindex.name, n, d, w=0.05, start.pos, stop.pos, psiblast.path,
database.path, hmmpfam.path, pfam.path, Evalue=10^-5,
na.type="all", na.strand="all", diprodb.method="all", diprodb.type="all",
svm.kernel="linear", svm.scale=FALSE, svm.path, svm.options="-t 0",
knn.k=1, nnet.size=2, nnet.rang=0.7, nnet.decay=0, nnet.maxit=100)
|
seq |
a string vector for the protein, DNA, or RNA sequences. |
classLable |
a factor or vector for the class lable of sequences in seq. |
fileName |
a string for the output file name. |
ele.type |
a string for the type of biological sequence. This must be one of the strings "rnaBase", "dnaBase", "aminoacid" or "aminoacid2". |
featureMethod |
a string vector for the name of feature coding. The alternative names are "Binary", "CTD", "FragmentComposition", "GapPairComposition", "CKSAAP", "Hydro", "ACH", "AAindex", "ACI", "ACF", "PseudoAAComp", "PSSM", "DOMAIN", "BDNAVIDEO", and "DIPRODB". |
classifyMethod |
a string for the classification method. This must be one of the strings "libsvm", "svmlight", "NaiveBayes", "randomForest", "knn", "tree", "nnet", "rpart", "ctree", "ctreelibsvm", "bagging". |
cv |
an integer for the time of cross validation, or a string "leave\_one\_out" for the jacknife test. |
group |
a string vector for the group of amino acids. This alternative groups are: "aaH", "aaV", "aaZ", "aaP", "aaF", "aaS" or "aaE". |
k |
an integer indicating the length of sequence fragment (k>=1). |
g |
an integer indicating the distance between two aminoacids/bases (g>=0). |
hydro.methods |
a string vector for the methods of coding protein hydrophobic effect. This alternative groups are: "kpm" or "SARAH1". |
hydro.indexs |
a string vector for the methods of coding protein hydrophobic effect. This alternative groups are: "hydroE", "hydroF" or "hydroC". |
aaindex.name |
a string for the name of physicochemical and biochemical properties in AAindx. |
n |
an integer used as paramter of |
d |
an integer used as paramter of |
w |
a numeric value for the weight factor of sequence order effect in
|
start.pos |
a integer vector denoting the start position of the fragment window. If it is missing, it is 1 by default. |
stop.pos |
a integer vector denoting the stop position of the fragment window. If it is missing, it is the length of sequence by default. |
psiblast.path |
a string for the path of PSI-BLAST program blastpgp. blastpgp will be employed to iteratively search database and generate position-specific scores for each position in the alignment. |
database.path |
a string for the path of formatted protein database. Database can be formatted by formatdb program. |
hmmpfam.path |
a string for the path of hammpfam program in HMMER. hammpfam will be employed to predict domains using models in Pfam database. |
pfam.path |
a string for the path of pfam domain database. |
Evalue |
a numeric value for the E-value cutoff of perdicted Pfam domain. |
na.type |
a string for nucleic acid type. It must be "DNA", "DNA/RNA", "RNA", or "all". |
na.strand |
a string for strand information. It must be "double", "single", or "all". |
diprodb.method |
a string for mode of property determination. It can be "experimental", "calculated", or "all". |
diprodb.type |
a string for property type. It can be "physicochemical", "conformational", "letter based", or "all". |
svm.kernel |
a string for kernel function of SVM. |
svm.scale |
a logical vector indicating the variables to be scaled. |
svm.path |
a character for path to SVMlight binaries (required, if path is unknown by the OS). |
svm.options |
Optional parameters to SVMlight. For further details see: "How to use" on http://svmlight.joachims.org/. (e.g.: "-t 2 -g 0.1")) |
nnet.size |
number of units in the hidden layer. Can be zero if there are skip-layer units. |
nnet.rang |
Initial random weights on [-rang, rang]. Value about 0.5 unless the inputs are large, in which case it should be chosen so that rang * max(|x|) is about 1. |
nnet.decay |
parameter for weight decay. |
nnet.maxit |
maximum number of iterations. |
knn.k |
number of neighbours considered in function |
featureEvaluate
can test feature coding methods for short
peptide, protein, DNA or RNA.
It returns a ranked list based on the accuracy of classification result.
Each element in the list has three components: "data", "model", and "performance".
"data" is a data.frame object, which stores feature matrix and its last column
is the class label. "model" is a vector for feature coding method, which
contains 6 elements: "Feature\_Function", "Feature\_Parameter",
"Feature\_Number", "Model", "Model\_Parameter", and "Cross_Validataion".
"performance" is a vector for the performance result of classification model,
which contains 10 elements: "tp", "tn", "fp", "fn", "prcc", "sn", "sp", "acc",
"mcc", "pc".
Hong Li
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 | ## read positive/negative sequence from files.
tmpfile1 = file.path(path.package("BioSeqClass"), "example", "acetylation_K.pos40.pep")
tmpfile2 = file.path(path.package("BioSeqClass"), "example", "acetylation_K.neg40.pep")
posSeq = as.matrix(read.csv(tmpfile1,header=FALSE,sep="\t",row.names=1))[,1]
negSeq = as.matrix(read.csv(tmpfile2,header=FALSE,sep="\t",row.names=1))[,1]
seq=c(posSeq,negSeq)
classLable=c(rep("+1",length(posSeq)),rep("-1",length(negSeq)) )
if(interactive()){
## test various feature coding methods.
## it may be time consuming.
fileName = tempfile()
testFeatureSet = featureEvaluate(seq, classLable, fileName, ele.type="aminoacid",
featureMethod=c("Binary", "CTD", "FragmentComposition", "GapPairComposition",
"Hydro"), cv=5, classifyMethod="libsvm",
group=c("aaH", "aaV", "aaZ", "aaP", "aaF", "aaS", "aaE"), k=3, g=7,
hydro.methods=c("kpm", "SARAH1"), hydro.indexs=c("hydroE", "hydroF", "hydroC") )
summary = read.csv(fileName,sep="\t",header=T)
fix(summary)
## Evaluate features from different feature coding functions
feature.index = 1:5
tmp <- testFeatureSet[[1]]$data
colnames(tmp) <- paste(testFeatureSet[[feature.index[1]]]$model["Feature_Function"],testFeatureSet[[feature.index[1]]]$model["Feature_Parameter"],colnames(tmp),sep=" ; ")
data <- tmp[,-ncol(tmp)]
for(i in 2:length(feature.index) ){
tmp <- testFeatureSet[[feature.index[i]]]$data
colnames(tmp) <- paste(testFeatureSet[[feature.index[i]]]$model["Feature_Function"],testFeatureSet[[feature.index[i]]]$model["Feature_Parameter"],colnames(tmp),sep=" ; ")
data <- data.frame(data, tmp[,-ncol(tmp)] )
}
name <- colnames(data)
data <- data.frame(data, tmp[,ncol(tmp)] )
## feature forward selection by 'cv_FFS_classify'
## it is very time consuming.
combineFeatureResult = fsFFS(data,stop.n=50,classifyMethod="knn",cv=5)
tmp = sapply(combineFeatureResult,function(x){c(length(x$features),x$performance["acc"])})
plot(tmp[1,],tmp[2,],xlab="featureNumber",ylab="Accuracy",main="result of FFS_KNN",pch=19)
lines(tmp[1,],tmp[2,])
## compare the prediction accuracy based on different feature coding methods and different classification models.
## it is very time consuming.
testResult = lapply(c("libsvm", "randomForest", "knn", "tree"),
function(x){
tmp = featureEvaluate(seq, classLable, fileName = tempfile(),
ele.type="aminoacid", featureMethod=c("Binary", "CTD", "FragmentComposition",
"GapPairComposition", "Hydro"), cv=5, classifyMethod=x,
group=c("aaH", "aaV", "aaZ", "aaP", "aaF", "aaS", "aaE"), k=3, g=7,
hydro.methods=c("kpm", "SARAH1"), hydro.indexs=c("hydroE", "hydroF", "hydroC") );
sapply(tmp,function(y){c(y$model[["Feature_Function"]], y$model[["Feature_Parameter"]], y$model[["Model"]], y$performance[["acc"]])})
})
tmpFeature = as.factor(c(sapply(testResult,function(x){apply(x[1:2,],2,function(y){paste(y,collapse="; ")})})))
tmpModel = as.factor(c(sapply(testResult,function(x){x[3,]})))
tmp1 = data.frame(as.integer(tmpFeature), as.integer(tmpModel), as.numeric(c(sapply(testResult,function(x){x[4,]}))) )
require(scatterplot3d)
s3d=scatterplot3d(tmp1,color=c("red","blue","green","yellow")[tmp1[,2]],pch=19,
xlab="Feature Coding", ylab="Classification Model",
zlab="Accuracy under 5-fold cross validation",lab=c(10,6,7),
y.ticklabs=c("",as.character(sort(unique(tmpModel))),"") )
}
|
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.