Description Usage Arguments Details Value Examples
View source: R/query-bagging.r
The 'query by bagging' approach to active learning applies bootstrap
aggregating (bagging) by randomly sampling with replacement C
times
from the training data to create a committee of C
classifiers. Our
goal is to "query the oracle" with the observations that have the maximum
disagreement among the C
trained classifiers.
1 2 | query_bagging(x, y, fit_f, predict_f, disagreement = c("kullback",
"vote_entropy", "post_entropy"), num_query = 1, C = 50, ...)
|
x |
a matrix containing the labeled and unlabeled data |
y |
a vector of the labels for each observation in |
disagreement |
a string that contains the disagreement measure among the committee members. See above for details. |
num_query |
the number of observations to be queried. |
C |
the number of bootstrap committee members |
... |
additional arguments passed to the function specified in
|
fit |
a function that has arguments |
predict |
a function that generates predictions for each sub-model. See
|
Note that this approach is similar to "Query by Committee" (QBC) in
query_committee
, but each committee member uses the same
classifier trained on a resampled subset of the labeled training data.
To determine maximum disagreement
among bagged committee members, we
have implemented three approaches:
query the unlabeled observation that maximizes the Kullback-Leibler divergence between the label distributions of any one committee member and the consensus
query the unlabeled observation that maximizes the vote entropy among all commitee members
query the unlabeled observation that maximizes the entropy of average posterior probabilities of all committee members
To calculate the committee disagreement, we use the formulae from Dr. Burr Settles' excellent "Active Learning Literature Survey" available at http://burrsettles.com/pub/settles.activelearning.pdf.
Unlabeled observations in y
are assumed to have NA
for a label.
It is often convenient to query unlabeled observations in batch. By default,
we query the unlabeled observations with the largest uncertainty measure
value. With the num_query
the user can specify the number of
observations to return in batch. If there are ties in the uncertainty
measure values, they are broken by the order in which the unlabeled
observations are given.
A parallel backend can be registered for building a QBB model using multiple
workers. For more details, see train
or
http://topepo.github.io/caret/parallel.html.
a list indicates which observations to query
along with the
disagreement
values of the unlabeled observations.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 | x <- iris[, -5]
y <- iris[, 5]
# For demonstration, suppose that few observations are labeled in 'y'.
y <- replace(y, -c(1:10, 51:60, 101:110), NA)
fit_f <- function(x, y, ...) {
MASS::lda(x, y, ...)
}
predict_f <- function(object, x) {
predict(object, x)$posterior
}
query_bagging(x=x, y=y, fit_f=fit_f, predict_f=predict_f, C=10)
query_bagging(x=x, y=y, fit_f=fit_f, predict_f=predict_f, C=10,
disagreement="vote_entropy", num_query=5)
|
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.