Description Usage Arguments Details Value Note References Examples
Ensemble-based filter for removing label noise from a dataset as a preprocessing step of classification. For more information, see 'Details' and 'References' sections.
1 2 3 4 5 6 |
formula |
A formula describing the classification variable and the attributes to be used. |
data, x |
Data frame containing the tranining dataset to be filtered. |
... |
Optional parameters to be passed to other methods. |
nfolds |
Number of partitions in each iteration. |
consensus |
Logical. If FALSE, majority voting scheme is used. If TRUE, consensus voting scheme is applied. |
p |
Real number between 0 and 1. It sets the minimum proportion of original instances which must be tagged as noisy in order to go for another iteration. |
s |
Positive integer setting the stop criterion together with |
y |
Real number between 0 and 1. It sets the proportion of good instances which must be stored in each iteration. |
classColumn |
Positive integer indicating the column which contains the (factor of) classes. By default, the last column is considered. |
The full description of the method can be looked up in the provided references.
A base classifier is built in each of the nfolds
partitions of data
. Then, they are
tested in the whole dataset, and the removal of noisy instances is decided via consensus or
majority voting schemes. Finally, a proportion of good instances (i.e. those whose label agrees
with all the base classifiers) is stored and removed for the next iteration. The process stops
after s
iterations with not enough (according to the proportion p
) noisy
instances removed. In this implementation, the base classifier used is C4.5.
An object of class filter
, which is a list with seven components:
cleanData
is a data frame containing the filtered dataset.
remIdx
is a vector of integers indicating the indexes for
removed instances (i.e. their row number with respect to the original data frame).
repIdx
is a vector of integers indicating the indexes for
repaired/relabelled instances (i.e. their row number with respect to the original data frame).
repLab
is a factor containing the new labels for repaired instances.
parameters
is a list containing the argument values.
call
contains the original call to the filter.
extraInf
is a character that includes additional interesting
information not covered by previous items.
By means of a message, the number of noisy instances removed in each iteration is displayed in the console.
Khoshgoftaar T. M., Rebours P. (2007): Improving software quality prediction by noise filtering techniques. Journal of Computer Science and Technology, 22(3), 387-396.
Zhu X., Wu X., Chen Q. (2003, August): Eliminating class noise in large datasets. International Conference in Machine Learning (Vol. 3, pp. 920-927).
1 2 3 4 5 6 7 8 9 10 | # Next example is not run in order to save time
## Not run:
data(iris)
# We fix a seed since there exists a random folds partition for the ensemble
set.seed(1)
out <- IPF(Species~., data = iris, s = 2)
summary(out, explicit = TRUE)
identical(out$cleanData, iris[setdiff(1:nrow(iris),out$remIdx),])
## End(Not run)
|
OpenJDK 64-Bit Server VM warning: Can't detect initial thread stack location - find_vma failed
Iteration 1: 3 noisy instances removed
Iteration 2: 0 noisy instances removed
Iteration 3: 1 noisy instances removed
Filter IPF applied to dataset iris
Call:
IPF(formula = Species ~ ., data = iris, s = 2)
Parameters:
nfolds: 5
consensus: FALSE
p: 0.01
s: 2
y: 0.5
Results:
Number of removed instances: 4 (2.666667 %)
Number of repaired instances: 0 (0 %)
Explicit indexes for removed instances:
71 120 134 135
[1] TRUE
Warning message:
system call failed: Cannot allocate memory
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.