ORBoostFilter: Outlier Removal Boosting Filter
In NoiseFiltersR: Label Noise Filters for Data Preprocessing in Classification

Description Usage Arguments Details Value Note References Examples

Ensemble-based filter for removing label noise from a dataset as a preprocessing step of classification. For more information, see 'Details' and 'References' sections.

## S3 method for class 'formula'
ORBoostFilter(formula, data, ...)

## Default S3 method:
ORBoostFilter(x, N = 20, d = 11, Naux = max(20, N),
  useDecisionStump = FALSE, classColumn = ncol(x), ...)

`formula`	A formula describing the classification variable and the attributes to be used.
`data, x`	Data frame containing the tranining dataset to be filtered.
`...`	Optional parameters to be passed to other methods.
`N`	Number of boosting iterations.
`d`	Threshold for removing noisy instances. Authors recommend to set it between 3 and 20. If it is set to `NULL`, the optimal threshold is chosen according to the procedure described in Karmaker & Kwek. However, this can be very time-consuming, and in most cases is little relevant for the final result.
`Naux`	Number of boosting iterations for AdaBoost when computing the optimal threshold 'd'.
`useDecisionStump`	If `TRUE`, a decision stump is used as weak classifier. Otherwise (default), naive-Bayes is applied. Recall decision stumps are not appropriate for multi-class problems.
`classColumn`	Positive integer indicating the column which contains the (factor of) classes. By default, the last column is considered.

The full description of ORBoostFilter method can be looked up in Karmaker & Kwek. In general terms, a weak classifier is built in each iteration, and misclassified instances have their weight increased for the next round. Instances are removed when their weight exceeds the threshold d, i.e. they have been misclassified in consecutive rounds.

An object of class filter, which is a list with seven components:

cleanData is a data frame containing the filtered dataset.
remIdx is a vector of integers indicating the indexes for removed instances (i.e. their row number with respect to the original data frame).
repIdx is a vector of integers indicating the indexes for repaired/relabelled instances (i.e. their row number with respect to the original data frame).
repLab is a factor containing the new labels for repaired instances.
parameters is a list containing the argument values.
call contains the original call to the filter.
extraInf is a character that includes additional interesting information not covered by previous items.

By means of a message, the number of noisy instances removed in each iteration is displayed in the console.

Karmaker A., Kwek S. (2005, November): A boosting approach to remove class label noise. In Hybrid Intelligent Systems, 2005. HIS'05. Fifth International Conference on (pp. 6-pp). IEEE.

Freund Y., Schapire R. E. (1997): A decision-theoretic generalization of on-line learning and an application to boosting. Journal of computer and system sciences, 55(1), 119-139.

# Next example is not run in order to save time
## Not run: 
data(iris)
out <- ORBoostFilter(Species~., data = iris, N = 10)
summary(out)
identical(out$cleanData, iris[setdiff(1:nrow(iris),out$remIdx),])

## End(Not run)

Iteration 1: 0 noisy instances removed.
Iteration 2: 6 noisy instances removed.
Iteration 3: 0 noisy instances removed.
Iteration 4: 0 noisy instances removed.
Iteration 5: 0 noisy instances removed.
Iteration 6: 0 noisy instances removed.
Iteration 7: 0 noisy instances removed.
Iteration 8: 0 noisy instances removed.
Iteration 9: 0 noisy instances removed.
Iteration 10: 0 noisy instances removed.
Filter ORBoostFilter applied to dataset iris 

Call:
ORBoostFilter(formula = Species ~ ., data = iris, N = 10)

Parameters:
N: 10
d: 11
Naux: 20
useDecisionStump: FALSE

Results:
Number of removed instances: 6 (4 %)
Number of repaired instances: 0 (0 %)
[1] TRUE