View source: R/frontend-bootstrap.R
bn.cv | R Documentation |
Perform a k-fold or hold-out cross-validation for a learning algorithm or a fixed network structure.
bn.cv(data, bn, loss = NULL, ..., algorithm.args = list(),
loss.args = list(), fit, fit.args = list(), method = "k-fold",
cluster, debug = FALSE)
## S3 method for class 'bn.kcv'
plot(x, ..., main, xlab, ylab, connect = FALSE)
## S3 method for class 'bn.kcv.list'
plot(x, ..., main, xlab, ylab, connect = FALSE)
loss(x)
data |
a data frame containing the variables in the model. |
bn |
either a character string (the label of the learning algorithm to
be applied to the training data in each iteration) or an object of class
|
loss |
a character string, the label of a loss function. If none is specified, the default loss function is the Classification Error for Bayesian networks classifiers; otherwise, the Log-Likelihood Loss for both discrete and continuous data sets. See below for additional details. |
algorithm.args |
a list of extra arguments to be passed to the learning algorithm. |
loss.args |
a list of extra arguments to be passed to the loss function
specified by |
fit |
a character string, the label of the method used to fit the
parameters of the network. See |
fit.args |
additional arguments for the parameter estimation procedure,
see again |
method |
a character string, either |
cluster |
an optional cluster object from package parallel. |
debug |
a boolean value. If |
x |
an object of class |
... |
additional objects of class |
main , xlab , ylab |
the title of the plot, an array of labels for the boxplot, the label for the y axis. |
connect |
a logical value. If |
bn.cv()
returns an object of class bn.kcv.list
if runs
is at least 2, an object of class bn.kcv
if runs
is equal to 1.
loss()
returns a numeric vector with a length equal to runs
.
The following cross-validation methods are implemented:
k-fold: the data
are split in k
subsets of equal
size. For each subset in turn, bn
is fitted (and possibly learned
as well) on the other k - 1
subsets and the loss function is then
computed using that subset. Loss estimates for each of the k
subsets are then combined to give an overall loss for data
.
custom-folds: the data are manually partitioned by the user into subsets, which are then used as in k-fold cross-validation. Subsets are not constrained to have the same size, and every observation must be assigned to one subset.
hold-out: k
subsamples of size m
are sampled
independently without replacement from the data
. For each
subsample, bn
is fitted (and possibly learned) on the remaining
m - nrow(data)
samples and the loss function is computed on the
m
observations in the subsample. The overall loss estimate is the
average of the k
loss estimates from the subsamples.
If cross-validation is used with multiple runs
, the overall loss is the
averge of the loss estimates from the different runs.
To clarify, cross-validation methods accept the following optional arguments:
k
: a positive integer number, the number of groups into which
the data will be split (in k-fold cross-validation) or the number of times
the data will be split in training and test samples (in hold-out
cross-validation).
m
: a positive integer number, the size of the test set in
hold-out cross-validation.
runs
: a positive integer number, the number of times
k-fold or hold-out cross-validation will be run.
folds
: a list in which element corresponds to one fold and
contains the indices for the observations that are included to that fold;
or a list with an element for each run, in which each element is itself a
list of the folds to be used for that run.
The following loss functions are implemented:
Log-Likelihood Loss (logl
): also known as negative
entropy or negentropy, it is the negated expected log-likelihood
of the test set for the Bayesian network fitted from the training set.
Lower valuer are better.
Classification Error (pred
): the prediction error
for a single discrete node. Lower values are better.
Exact Classification Error (pred-exact
): closed-form
exact posterior predictions are available for Bayesian network
classifiers. Lower values are better.
Predictive Correlation (cor
): the correlation
between the observed and the predicted values for a single continuous
node. Higher values are better.
Mean Squared Error (mse
): the mean squared error
between the observed and the predicted values for a single continuous
node. Lower values are better.
F1 score (f1
): the F1 score between observed and
predicted values for both binary and multiclass target variables.
AUROC (auroc
): the area under the ROC curve for
both binary and multiclass target variables. The multiclass AUROC score
is computed as one-vs-rest by averaging the AUROC for each level of the
target variable.
Optional arguments that can be specified in loss.args
are:
predict
: a character string, the label of the method used to
predict the observations in the test set. The default is "parents"
.
Other possible values are the same as in predict()
.
predict.args
: a list containing the optional arguments for
the prediction method. See the documentation for predict()
for
more details.
target
: a character string, the label of target node for
prediction in all loss functions but logl
, logl-g
and
logl-cg
.
Both plot methods accept any combination of objects of class bn.kcv
or
bn.kcv.list
(the first as the x
argument, the remaining as the
...
argument) and plot the respected expected loss values side by side.
For a bn.kcv
object, this mean a single point; for a bn.kcv.list
object this means a boxplot.
Marco Scutari
Koller D, Friedman N (2009). Probabilistic Graphical Models: Principles and Techniques. MIT Press.
bn.boot
, rbn
, bn.kcv-class
.
bn.cv(learning.test, 'hc', loss = "pred",
loss.args = list(predict = "bayes-lw", target = "F"))
folds = list(1:2000, 2001:3000, 3001:5000)
bn.cv(learning.test, 'hc', loss = "logl", method = "custom-folds",
folds = folds)
xval = bn.cv(gaussian.test, 'mmhc', method = "hold-out",
k = 5, m = 50, runs = 2)
xval
loss(xval)
## Not run:
# comparing algorithms with multiple runs of cross-validation.
gaussian.subset = gaussian.test[1:50, ]
cv.gs = bn.cv(gaussian.subset, 'gs', runs = 10)
cv.iamb = bn.cv(gaussian.subset, 'iamb', runs = 10)
cv.inter = bn.cv(gaussian.subset, 'inter.iamb', runs = 10)
plot(cv.gs, cv.iamb, cv.inter,
xlab = c("Grow-Shrink", "IAMB", "Inter-IAMB"), connect = TRUE)
# use custom folds.
folds = split(sample(nrow(gaussian.subset)), seq(5))
bn.cv(gaussian.subset, "hc", method = "custom-folds", folds = folds)
# multiple runs, with custom folds.
folds = replicate(5, split(sample(nrow(gaussian.subset)), seq(5)),
simplify = FALSE)
bn.cv(gaussian.subset, "hc", method = "custom-folds", folds = folds)
## End(Not run)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.