Outlier Removal Boosting Filter

Description

Ensemble-based filter for removing label noise from a dataset as a preprocessing step of classification. For more information, see 'Details' and 'References' sections.

Usage

1
2
3
4
5
6
## S3 method for class 'formula'
ORBoostFilter(formula, data, ...)

## Default S3 method:
ORBoostFilter(x, N = 20, d = 11, Naux = max(20, N),
  useDecisionStump = FALSE, classColumn = ncol(x), ...)

Arguments

formula

A formula describing the classification variable and the attributes to be used.

data, x

Data frame containing the tranining dataset to be filtered.

...

Optional parameters to be passed to other methods.

N

Number of boosting iterations.

d

Threshold for removing noisy instances. Authors recommend to set it between 3 and 20. If it is set to NULL, the optimal threshold is chosen according to the procedure described in Karmaker & Kwek. However, this can be very time-consuming, and in most cases is little relevant for the final result.

Naux

Number of boosting iterations for AdaBoost when computing the optimal threshold 'd'.

useDecisionStump

If TRUE, a decision stump is used as weak classifier. Otherwise (default), naive-Bayes is applied. Recall decision stumps are not appropriate for multi-class problems.

classColumn

Positive integer indicating the column which contains the (factor of) classes. By default, the last column is considered.

Details

The full description of ORBoostFilter method can be looked up in Karmaker & Kwek. In general terms, a weak classifier is built in each iteration, and misclassified instances have their weight increased for the next round. Instances are removed when their weight exceeds the threshold d, i.e. they have been misclassified in consecutive rounds.

Value

An object of class filter, which is a list with seven components:

  • cleanData is a data frame containing the filtered dataset.

  • remIdx is a vector of integers indicating the indexes for removed instances (i.e. their row number with respect to the original data frame).

  • repIdx is a vector of integers indicating the indexes for repaired/relabelled instances (i.e. their row number with respect to the original data frame).

  • repLab is a factor containing the new labels for repaired instances.

  • parameters is a list containing the argument values.

  • call contains the original call to the filter.

  • extraInf is a character that includes additional interesting information not covered by previous items.

Note

By means of a message, the number of noisy instances removed in each iteration is displayed in the console.

References

Karmaker A., Kwek S. (2005, November): A boosting approach to remove class label noise. In Hybrid Intelligent Systems, 2005. HIS'05. Fifth International Conference on (pp. 6-pp). IEEE.

Freund Y., Schapire R. E. (1997): A decision-theoretic generalization of on-line learning and an application to boosting. Journal of computer and system sciences, 55(1), 119-139.

Examples

1
2
3
4
5
6
7
8
# Next example is not run in order to save time
## Not run: 
data(iris)
out <- ORBoostFilter(Species~., data = iris, N = 10)
summary(out)
identical(out$cleanData, iris[setdiff(1:nrow(iris),out$remIdx),])

## End(Not run)