Similarity-based filter for removing or repairing label noise from a dataset as a preprocessing step of classification. For more information, see 'Details' and 'References' sections.
1 2 3 4 5 6 |
formula |
A formula describing the classification variable and the attributes to be used. |
data, x |
Data frame containing the tranining dataset to be filtered. |
... |
Optional parameters to be passed to other methods. |
threshold |
Real number between 0 and 1. It sets the limit between good and suspicious instances. Its default value is 0.25. |
noiseAction |
Character being either 'remove' or 'hybrid'. It determines what to do with noisy instances. By default, it is set to 'remove'. |
classColumn |
positive integer indicating the column which contains the (factor of) classes. By default, the last column is considered. |
EWF
builds up a Relative Neighborhood Graph (RNG) from the dataset. Then, it identifies
as 'suspicious' those instances with a significant value of itslocal cut edge weight statistic, which
intuitively means that they are surrounded by examples from a different class.
Namely, the aforementioned statistic is the sum of the weights of edges joining
the instance (in the RNG graph) with instances from a different class.
Under the null hypothesis of the class label being independent of
the event 'being neighbors in the RNG graph', the distribution of this statistic can be approximated by a
gaussian one. Then, the p-value for the observed value is computed and contrasted with the
provided threshold
.
To handle 'suspicious' instances there are two approaches ('remove' or 'hybrid'), and the argument 'noiseAction' determines which one to use. With 'remove', every suspect is removed from the dataset. With the 'hybrid' approach, an instance is removed if it does not have good (i.e. non-suspicious) RNG-neighbors. Otherwise, it is relabelled with the majority class among its good RNG-neighbors.
An object of class filter
, which is a list with seven components:
cleanData
is a data frame containing the filtered dataset.
remIdx
is a vector of integers indicating the indexes for
removed instances (i.e. their row number with respect to the original data frame).
repIdx
is a vector of integers indicating the indexes for
repaired/relabelled instances (i.e. their row number with respect to the original data frame).
repLab
is a factor containing the new labels for repaired instances.
parameters
is a list containing the argument values.
call
contains the original call to the filter.
extraInf
is a character that includes additional interesting
information not covered by previous items.
Muhlenbach F., Lallich S., Zighed D. A. (2004): Identifying and handling mislabelled instances. Journal of Intelligent Information Systems, 22(1), 89-109.
1 2 3 4 5 6 7 8 |
Questions? Problems? Suggestions? Tweet to @rdrrHQ or email at ian@mutexlabs.com.
Please suggest features or report bugs with the GitHub issue tracker.
All documentation is copyright its authors; we didn't write any of that.