uncertainty_sampling: Active Learning with Uncertainty Sampling

Description Usage Arguments Details Value Examples

View source: R/uncertainty-sampling.r

Description

The 'uncertainty sampling' approach to active learning determines the unlabeled observation which the user-specified supervised classifier is "least certain." The "least certain" observation should then be queried by the oracle in the "active learning" framework.

Usage

1
2
uncertainty_sampling(x, y, uncertainty = "entropy", classifier,
  num_query = 1, ...)

Arguments

x

a matrix containing the labeled and unlabeled data

y

a vector of the labels for each observation in x. Use NA for unlabeled.

uncertainty

a string that contains the uncertainty measure. See above for details.

classifier

a string that contains the supervised classifier as given in the caret package.

num_query

the number of observations to be queried.

...

additional arguments that are sent to the caret classifier.

Details

The least certainty term is quite general, but we have implemented three of the most widely used methods:

entropy

query the unlabeled observation maximizing posterior probabilities of each class under the trained classifier

least_confidence

query the unlabeled observation with the least posterior probability under the trained classifier

margin

query the unlabeled observation that minimizes the difference in the largest two posterior probabilities under the trained classifier

The uncertainty argument must be one of the three: entropy is the default. Note that the three methods are equivalent (they yield the same observation to be queried) with binary classification.

We require a user-specified supervised classifier from the caret R package. Furthermore, we assume that the classifier returns posterior probabilities of class membership; otherwise, an error is thrown. To obtain a list of valid classifiers, see the caret vignettes, which are available on CRAN. Also, see the modelLookup.

Additional arguments to the specified caret classifier can be passed via ....

Unlabeled observations in y are assumed to have NA for a label.

It is often convenient to query unlabeled observations in batch. By default, we query the unlabeled observations with the largest uncertainty measure value. With the num_query the user can specify the number of observations to return in batch. If there are ties in the uncertainty measure values, they are broken by the order in which the unlabeled observations are given.

Value

a list that contains the least_certain observation and miscellaneous results. See above for details.

Examples

1
2
3
4
5
6
7
8
9
x <- iris[, -5]
y <- iris[, 5]

# For demonstration, suppose that few observations are labeled in 'y'.
y <- replace(y, -c(1:10, 51:60, 101:110), NA)

uncertainty_sampling(x=x, y=y, classifier="lda")
uncertainty_sampling(x=x, y=y, uncertainty="entropy",
                    classifier="qda", num_query=5)

ramhiser/activelearning documentation built on May 26, 2019, 10:06 p.m.