query_bagging: Active learning with "Query by Bagging"

Description Usage Arguments Details Value Examples

View source: R/query-bagging.r

Description

The 'query by bagging' approach to active learning applies bootstrap aggregating (bagging) by randomly sampling with replacement C times from the training data to create a committee of C classifiers. Our goal is to "query the oracle" with the observations that have the maximum disagreement among the C trained classifiers.

Usage

1
2
query_bagging(x, y, fit_f, predict_f, disagreement = c("kullback",
  "vote_entropy", "post_entropy"), num_query = 1, C = 50, ...)

Arguments

x

a matrix containing the labeled and unlabeled data

y

a vector of the labels for each observation in x. Use NA for unlabeled observations.

disagreement

a string that contains the disagreement measure among the committee members. See above for details.

num_query

the number of observations to be queried.

C

the number of bootstrap committee members

...

additional arguments passed to the function specified in fit

fit

a function that has arguments x, y, and ... and produces a model object that can later be used for prediction. function that generates predictions for each sub-model. See bagControl for more details.

predict

a function that generates predictions for each sub-model. See bagControl for more details.

Details

Note that this approach is similar to "Query by Committee" (QBC) in query_committee, but each committee member uses the same classifier trained on a resampled subset of the labeled training data.

To determine maximum disagreement among bagged committee members, we have implemented three approaches:

kullback

query the unlabeled observation that maximizes the Kullback-Leibler divergence between the label distributions of any one committee member and the consensus

vote_entropy

query the unlabeled observation that maximizes the vote entropy among all commitee members

post_entropy

query the unlabeled observation that maximizes the entropy of average posterior probabilities of all committee members

To calculate the committee disagreement, we use the formulae from Dr. Burr Settles' excellent "Active Learning Literature Survey" available at http://burrsettles.com/pub/settles.activelearning.pdf.

Unlabeled observations in y are assumed to have NA for a label.

It is often convenient to query unlabeled observations in batch. By default, we query the unlabeled observations with the largest uncertainty measure value. With the num_query the user can specify the number of observations to return in batch. If there are ties in the uncertainty measure values, they are broken by the order in which the unlabeled observations are given.

A parallel backend can be registered for building a QBB model using multiple workers. For more details, see train or http://topepo.github.io/caret/parallel.html.

Value

a list indicates which observations to query along with the disagreement values of the unlabeled observations.

Examples

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
x <- iris[, -5]
y <- iris[, 5]

# For demonstration, suppose that few observations are labeled in 'y'.
y <- replace(y, -c(1:10, 51:60, 101:110), NA)

fit_f <- function(x, y, ...) {
  MASS::lda(x, y, ...)
}
predict_f <- function(object, x) {
  predict(object, x)$posterior
}

query_bagging(x=x, y=y, fit_f=fit_f, predict_f=predict_f, C=10)
query_bagging(x=x, y=y, fit_f=fit_f, predict_f=predict_f, C=10,
              disagreement="vote_entropy", num_query=5)

ramhiser/activelearning documentation built on May 26, 2019, 10:06 p.m.