CombineSplits | R Documentation |
CombineSplits
evaluates a specified performance measure
across all splits created by ModelTrain
and conducts
statistical tests to determine the best performing descriptor set and
model (D-M) combinations. Performance
can evaluate many
performance measures across all splits created by ModelTrain
,
then outputs a data frame for each D-M combination.
CombineSplits(cml.result, metric = "enhancement", m = NA, thresh = 0.5)
Performance(cml.result, metrics = "enhancement", m = NA, thresh = 0.5)
cml.result |
an object of class |
metric |
the model performance measure to use. This should be
one of |
m |
the number of tests to use for binary model
performance measures
(see Details).
If |
thresh |
if the predicted probability that a binary response is
1 is above this threshold, an observation is classified as 1. Used
to compute |
metrics |
a character vector containing a subset of the performance
measures above. |
CombineSplits
quantifies how sensitive performance measures are to fold
assignments (assignments to training and test sets).
Intuitively, this
assesses how much a performance measure may change if a slightly
different data set is used.
ModelTrain
is a designed study in that 'experimental'
conditions are defined according to two factors: method (D-M combination)
and split (fold assignment). The factor "split" is a blocking factor,
and factor "method" is of primary interest. The design of this
experiment is amenable to an analysis
of variance to identify significant differences between
performance measures according to factors and levels.
CombineSplits outputs such an analysis of variance decomposition.
The multiple comparisons similarity (MCS) plot shows the results for tests for signficance in all pairwise differences of D-M mean performance measures. Because there can be many estimated mean performance measures for a dataset, care must be taken to adjust for multiple testing, and we do this using the Tukey-Kramer multiple comparison procedure (see Tukey (1953) and Kramer (1956)). If you are having trouble viewing all the components of the plot, make the plotting window larger.
By default, CombineSplits
uses initial enhancement
proposed by Kearsley et al. (1996) to assess model performance.
Enhancement at m
tests is the hit
rate at m
tests (accumulated actives at m
tests
divided by m
) divided by the proportion of actives in the entire
collection. It is a relative measure of hit rate improvement offered
by the new method beyond what can be expected under random selection,
and values much larger than one are desired. Initial enhancement is
typically taken to be enhancement at m
=300 tests.
Root mean squared error (RMSE
), despite its popularity
in statistics, may be
inappropriate for continuous chemical assay responses because
it assumes losses
are equal for both under-predicting and over-predicting biological
activity. A suitable alternative may be initial enhancement
.
Other options are the coeffcient of determination (R2
)
and Spearman's rho
.
For binary chemical assay responses, alternatives to
misclassification rate (error rate
)
(which may be inappropriate because it assigns equal weights to false
positives and false negatives) include sensitivity
,
specificity
,
area under the receiver operating characteristic curve (auc
),
positive predictive value, also known as precision (ppv
), F1 measure (fmeasure
),
and initial enhancement
.
Performance
: outputs a data frame with performance measures for each D-M
combination.
Jacqueline Hughes-Oliver, Jeremy Ash, Atina Brooks
Kearsley, S.K., Sallamack, S., Fluder, E.M., Andose, J.D., Mosley, R.T., and Sheridan, R.P. (1996). Chemical similarity using physiochemical property descriptors, J. Chem. Inf. Comput. Sci. 36, 118-127.
Kramer, C. Y. (1956). Extension of multiple range tests to group means with unequal numbers of replications. Biometrics 12, 307-310.
Tukey, J. W. (1953). The problem of multiple comparisons. Unpublished manuscript. In The Collected Works of John W. Tukey VIII. Multiple Comparisons: 1948-1983, Chapman and Hall, New York.
chemmodlab
, ModelTrain
## Not run:
# A data set with binary response and multiple descriptor sets
data(aid364)
cml <- ModelTrain(aid364, ids = TRUE, xcol.lengths = c(24, 147),
des.names = c("BurdenNumbers", "Pharmacophores"))
CombineSplits(cml)
## End(Not run)
# A continuous response
cml <- ModelTrain(USArrests, nsplits = 2, nfolds = 2,
models = c("KNN", "Lasso", "Tree"))
CombineSplits(cml)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.