ORBoostFilter: Outlier Removal Boosting Filter

Description Usage Arguments Details Value Note References Examples

Description

Ensemble-based filter for removing label noise from a dataset as a preprocessing step of classification. For more information, see 'Details' and 'References' sections.

Usage

1
2
3
4
5
6
## S3 method for class 'formula'
ORBoostFilter(formula, data, ...)

## Default S3 method:
ORBoostFilter(x, N = 20, d = 11, Naux = max(20, N),
  useDecisionStump = FALSE, classColumn = ncol(x), ...)

Arguments

formula

A formula describing the classification variable and the attributes to be used.

data, x

Data frame containing the tranining dataset to be filtered.

...

Optional parameters to be passed to other methods.

N

Number of boosting iterations.

d

Threshold for removing noisy instances. Authors recommend to set it between 3 and 20. If it is set to NULL, the optimal threshold is chosen according to the procedure described in Karmaker & Kwek. However, this can be very time-consuming, and in most cases is little relevant for the final result.

Naux

Number of boosting iterations for AdaBoost when computing the optimal threshold 'd'.

useDecisionStump

If TRUE, a decision stump is used as weak classifier. Otherwise (default), naive-Bayes is applied. Recall decision stumps are not appropriate for multi-class problems.

classColumn

Positive integer indicating the column which contains the (factor of) classes. By default, the last column is considered.

Details

The full description of ORBoostFilter method can be looked up in Karmaker & Kwek. In general terms, a weak classifier is built in each iteration, and misclassified instances have their weight increased for the next round. Instances are removed when their weight exceeds the threshold d, i.e. they have been misclassified in consecutive rounds.

Value

An object of class filter, which is a list with seven components:

Note

By means of a message, the number of noisy instances removed in each iteration is displayed in the console.

References

Karmaker A., Kwek S. (2005, November): A boosting approach to remove class label noise. In Hybrid Intelligent Systems, 2005. HIS'05. Fifth International Conference on (pp. 6-pp). IEEE.

Freund Y., Schapire R. E. (1997): A decision-theoretic generalization of on-line learning and an application to boosting. Journal of computer and system sciences, 55(1), 119-139.

Examples

1
2
3
4
5
6
7
8
# Next example is not run in order to save time
## Not run: 
data(iris)
out <- ORBoostFilter(Species~., data = iris, N = 10)
summary(out)
identical(out$cleanData, iris[setdiff(1:nrow(iris),out$remIdx),])

## End(Not run)

Example output

Iteration 1: 0 noisy instances removed.
Iteration 2: 6 noisy instances removed.
Iteration 3: 0 noisy instances removed.
Iteration 4: 0 noisy instances removed.
Iteration 5: 0 noisy instances removed.
Iteration 6: 0 noisy instances removed.
Iteration 7: 0 noisy instances removed.
Iteration 8: 0 noisy instances removed.
Iteration 9: 0 noisy instances removed.
Iteration 10: 0 noisy instances removed.
Filter ORBoostFilter applied to dataset iris 

Call:
ORBoostFilter(formula = Species ~ ., data = iris, N = 10)

Parameters:
N: 10
d: 11
Naux: 20
useDecisionStump: FALSE

Results:
Number of removed instances: 6 (4 %)
Number of repaired instances: 0 (0 %)
[1] TRUE

NoiseFiltersR documentation built on May 2, 2019, 2:03 a.m.