Description Details Author(s) References See Also Examples
The package is intended as a convenient wrapper to multiple classification and feature selection algorithms for two class classification problems. For example, to classify high and low risk patients from breast cancer molecular profiling data, classification training, performance evaluation and bootstrapped feature selection is done using multiple algorithms. The combination of the selected feature lists during the bootstrapping yields a highly robust final set of features.
Package: | bootfs |
Type: | Package |
License: | GPL (>=2) |
The following methods are implemented:
SVM + SCAD: Support vector machines with Smoothly clipped absolute deviation feature selection (used from package penalizedSVM
). Also, other models implemented in the penalizedSVM package can be used.
RF + Boruta: Random forests and Boruta feature selection.
RF: Random forest algorithm of Breiman.
PAMR: Prediction analysis for microarrays
GBM: Generalized boosting machine.
Note: DrHSVM method is slow and sometimes throws errors, so use with care.
Christian Bender
Maintainer: Christian Bender <christian.bender@tron-mainz.de>
Boser, B.E., Guyon, I.M., Vapnik, V.N.: A Training Algorithm for Optimal Margin Classifiers. Proceedings of the 5th Annual ACM Workshop on Computational Learning Theory, 1992.
Zhang, Hao Helen and Ahn, Jeongyoun and Lin, Xiaodong and Park, Cheolwoo: Gene selection using support vector machines with non-convex penalty. Bioinformatics (2006) 22 (1): 88-95
Breiman, L. (2001), Random Forests, Machine Learning 45(1), 5-32
Miron B. Kursa, Witold R. Rudnicki (2010). Feature Selection with the Boruta Package. Journal of Statistical Software, 36(11), p. 1-13. URL: http://www.jstatsoft.org/v36/i11/
Robert Tibshirani, Trevor Hastie, Balasubramanian Narasimhan, and Gilbert Chu. "Diagnosis of multiple cancer types by shrunken centroids of gene expression" PNAS 2002 99:6567-6572 (May 14)
Friedman, J. H.: Stochastic Gradient Boosting. (March 1999)
Friedman, J. H.: Greedy function approximation: A gradient boosting machine. Ann. Statist. Volume 29, Number 5 (2001), 1189-1232.
control_params
, doCV
, doBS
, resultBS
, importance_igraph
Help for penalizedSVM
Help for Boruta
Help for randomForest
Help for pamr
Help for gbm
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 | ## Not run:
# library(bootfs)
set.seed(1234)
data <- simDataSet(nsam=30, ngen=100, sigma=2, plot=TRUE)
logX <- data$logX
groupings <- data$groupings
## all methods:
#methods <- c("pamr", "gbm", "rf_boruta", "rf", "scad", "DrHSVM", "1norm", "scad+L2")
## selected methods
methods <- c("pamr", "gbm", "rf")
## create a parameter object used for the different methods
# for crossvalidation
paramsCV <- control_params(seed=123,
ncv=5, repeats=10, jitter=FALSE, ## general parameters
maxiter=100, maxevals=50, ## svm parameters
max_allowed_feat=500, n.threshold=50, ## pamr parameters
maxRuns=300, ## RF parameters
ntree = 1000, ## GBM parameters
shrinkage = 0.01, interaction.depth = 3,
bag.fraction = 0.75, train.fraction = 0.75,
n.minobsinnode = 3, n.cores = 1,
verbose = TRUE)
# for bootstrapping
paramsBS <- control_params(seed=123,
jitter=FALSE, bstr=10, ## general parameters
maxiter=100, maxevals=50, bounds=NULL, ## svm parameters
max_allowed_feat=500, n.threshold=50, ## pamr parameters
maxRuns=300, ## RF parameters
ntree = 1000, ## GBM parameters
shrinkage = 0.01, interaction.depth = 3,
bag.fraction = 0.75, train.fraction = 0.75,
n.minobsinnode = 3, n.cores = 1,
verbose = TRUE, saveres=FALSE
)
## run the crossvalidation
## takes a while
retCV <- doCV(logX, groupings, fs.methods = methods, DIR = NULL, params=paramsCV)
# old interface
#retCV <- doCV(logX, groupings, fs.methods = methods,
# DIR = "cv", seed = 123, ncv = 5, repeats = 5,
# jitter=FALSE, maxiter = 100, maxevals = 50,
# max_allowed_feat = 50, n.threshold = 50, maxRuns = 30)
## run the bootstrapping
## takes a while
retBS <- doBS(logX, groupings, fs.methods=methods, DIR=NULL, params=paramsBS)
# old interface
#retBS <- doBS(logX, groupings,
# fs.methods=methods,
# DIR=NULL,
# seed=123, bstr=10, saveres=FALSE, jitter=FALSE,
# maxiter=100, maxevals=50, bounds=NULL,
# max_allowed_feat=NULL, n.threshold=50,
# maxRuns=30)
## only use one method
#retBS <- doBS(logX, groupings, fs.methods=methods[1], DIR=NULL,
# seed=123, bstr=2, saveres=FALSE, jitter=FALSE, maxiter=20,
# maxevals=20, bounds=NULL,max_allowed_feat=NULL,
# n.threshold=25,maxRuns=10)
## create the importance graph using the 3 FS-methods
## and export the adjacency matrix containing the
## numbers of occurrences of the features, as well
## as the top hits.
res <- resultBS(retBS, DIR=NULL, vlabel.cex = 3, filter = 0, saveres = FALSE)
## only use methods 1 and 2
res2 <- resultBS(retBS, DIR=NULL, vlabel.cex = 3, filter = 0, saveres = FALSE,
useresults=1:2)
## plot the importance graph
ig <- importance_igraph(res$adj, main = "my test",
highlight = NULL, layout="layout.ellipsis",
pdf=NULL, pointsize=12, tk=FALSE,
node.color="grey", node.filter=NULL,
vlabel.cex=1.2, vlabel.cex.min=0.5, vlabel.cex.max=4,
max_node_cex=8,
edge.width=1, edge.filter=1, max_edge_cex=2, ewprop=3 )
## show the data and groups
drawheat(logX, groups = groupings[[1]], log = FALSE,
mar = c(12, 10), distfun = dist.eucsq,
hclustfun = ward, cexCol = 1, cexRow = 1)
## subset for the tophits by using logX[tophits,]
adj <- res$adj
ord <- order(diag(res$adj),decreasing=TRUE)
adj <- res$adj[ord,ord]
tophits <- colnames(adj)[1:5]
drawheat(logX[,tophits], groups = groupings[[1]], log = FALSE,
mar = c(12, 10), distfun = dist.eucsq,
hclustfun = ward, cexCol = 1, cexRow = 1)
## do multiclass classification
data(iris)
groupings <- list(Species=iris$Species)
logX <- iris[,1:4]
methods <- c("gbm","rf_boruta","pamr")
## crossvalidation
retCV <- doCV(logX, groupings, fs.methods = methods, DIR = NULL, params=paramsCV)
## bootstrapped feature selection
retBS <- doBS(logX, groupings, fs.methods=methods, DIR=NULL, params=paramsBS)
## make results from all methods used
res <- resultBS(retBS, DIR="bs_multi", vlabel.cex = 3, filter = 0, saveres = FALSE)
## plot the importance graph
ig <- importance_igraph(res$adj, main = "my test",
highlight = NULL, layout="layout.ellipsis",
pdf=NULL, pointsize=12, tk=FALSE,
node.color="grey", node.filter=NULL,
vlabel.cex=1.2, vlabel.cex.min=0.5, vlabel.cex.max=4,
max_node_cex=8,
edge.width=1, edge.filter=1, max_edge_cex=2, ewprop=3 )
## removethe created directories
system("rm -rf cv")
system("rm -rf cv_multi")
## remove the created directory
system("rm -rf bs")
system("rm -rf bs_multi")
## End(Not run)
|
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.