Edge Weight Filter
Description
Similaritybased filter for removing or repairing label noise from a dataset as a preprocessing step of classification. For more information, see 'Details' and 'References' sections.
Usage
1 2 3 4 5 6 
Arguments
formula 
A formula describing the classification variable and the attributes to be used. 
data, x 
Data frame containing the tranining dataset to be filtered. 
... 
Optional parameters to be passed to other methods. 
threshold 
Real number between 0 and 1. It sets the limit between good and suspicious instances. Its default value is 0.25. 
noiseAction 
Character being either 'remove' or 'hybrid'. It determines what to do with noisy instances. By default, it is set to 'remove'. 
classColumn 
positive integer indicating the column which contains the (factor of) classes. By default, the last column is considered. 
Details
EWF
builds up a Relative Neighborhood Graph (RNG) from the dataset. Then, it identifies
as 'suspicious' those instances with a significant value of itslocal cut edge weight statistic, which
intuitively means that they are surrounded by examples from a different class.
Namely, the aforementioned statistic is the sum of the weights of edges joining
the instance (in the RNG graph) with instances from a different class.
Under the null hypothesis of the class label being independent of
the event 'being neighbors in the RNG graph', the distribution of this statistic can be approximated by a
gaussian one. Then, the pvalue for the observed value is computed and contrasted with the
provided threshold
.
To handle 'suspicious' instances there are two approaches ('remove' or 'hybrid'), and the argument 'noiseAction' determines which one to use. With 'remove', every suspect is removed from the dataset. With the 'hybrid' approach, an instance is removed if it does not have good (i.e. nonsuspicious) RNGneighbors. Otherwise, it is relabelled with the majority class among its good RNGneighbors.
Value
An object of class filter
, which is a list with seven components:

cleanData
is a data frame containing the filtered dataset. 
remIdx
is a vector of integers indicating the indexes for removed instances (i.e. their row number with respect to the original data frame). 
repIdx
is a vector of integers indicating the indexes for repaired/relabelled instances (i.e. their row number with respect to the original data frame). 
repLab
is a factor containing the new labels for repaired instances. 
parameters
is a list containing the argument values. 
call
contains the original call to the filter. 
extraInf
is a character that includes additional interesting information not covered by previous items.
References
Muhlenbach F., Lallich S., Zighed D. A. (2004): Identifying and handling mislabelled instances. Journal of Intelligent Information Systems, 22(1), 89109.
Examples
1 2 3 4 5 6 7 8 