query_bagging: Active learning with "Query by Bagging"
In ramhiser/activelearning: A Collection of Active Learning Methods in R

Description Usage Arguments Details Value Examples

View source: R/query-bagging.r

The 'query by bagging' approach to active learning applies bootstrap aggregating (bagging) by randomly sampling with replacement C times from the training data to create a committee of C classifiers. Our goal is to "query the oracle" with the observations that have the maximum disagreement among the C trained classifiers.

1 2	query_bagging(x, y, fit_f, predict_f, disagreement = c("kullback", "vote_entropy", "post_entropy"), num_query = 1, C = 50, ...)

`x`	a matrix containing the labeled and unlabeled data
`y`	a vector of the labels for each observation in `x`. Use `NA` for unlabeled observations.
`disagreement`	a string that contains the disagreement measure among the committee members. See above for details.
`num_query`	the number of observations to be queried.
`C`	the number of bootstrap committee members
`...`	additional arguments passed to the function specified in `fit`
`fit`	a function that has arguments `x`, `y`, and `...` and produces a model object that can later be used for prediction. function that generates predictions for each sub-model. See `bagControl` for more details.
`predict`	a function that generates predictions for each sub-model. See `bagControl` for more details.

Note that this approach is similar to "Query by Committee" (QBC) in query_committee, but each committee member uses the same classifier trained on a resampled subset of the labeled training data.

To determine maximum disagreement among bagged committee members, we have implemented three approaches:

kullback: query the unlabeled observation that maximizes the Kullback-Leibler divergence between the label distributions of any one committee member and the consensus
vote_entropy: query the unlabeled observation that maximizes the vote entropy among all commitee members
post_entropy: query the unlabeled observation that maximizes the entropy of average posterior probabilities of all committee members

To calculate the committee disagreement, we use the formulae from Dr. Burr Settles' excellent "Active Learning Literature Survey" available at http://burrsettles.com/pub/settles.activelearning.pdf.

Unlabeled observations in y are assumed to have NA for a label.

It is often convenient to query unlabeled observations in batch. By default, we query the unlabeled observations with the largest uncertainty measure value. With the num_query the user can specify the number of observations to return in batch. If there are ties in the uncertainty measure values, they are broken by the order in which the unlabeled observations are given.

A parallel backend can be registered for building a QBB model using multiple workers. For more details, see train or http://topepo.github.io/caret/parallel.html.

a list indicates which observations to query along with the disagreement values of the unlabeled observations.

x <- iris[, -5]
y <- iris[, 5]

# For demonstration, suppose that few observations are labeled in 'y'.
y <- replace(y, -c(1:10, 51:60, 101:110), NA)

fit_f <- function(x, y, ...) {
  MASS::lda(x, y, ...)
}
predict_f <- function(object, x) {
  predict(object, x)$posterior
}

query_bagging(x=x, y=y, fit_f=fit_f, predict_f=predict_f, C=10)
query_bagging(x=x, y=y, fit_f=fit_f, predict_f=predict_f, C=10,
              disagreement="vote_entropy", num_query=5)