Mode Filter

Description

Similarity-based filter for removing or repairing label noise from a dataset as a preprocessing step of classification. For more information, see 'Details' and 'References' sections.

Usage

1
2
3
4
5
6
7
## S3 method for class 'formula'
ModeFilter(formula, data, ...)

## Default S3 method:
ModeFilter(x, type = "classical", noiseAction = "repair",
  epsilon = 0.05, maxIter = 100, alpha = 1, beta = 1,
  classColumn = ncol(x), ...)

Arguments

formula

A formula describing the classification variable and the attributes to be used.

data, x

Data frame containing the tranining dataset to be filtered.

...

Optional parameters to be passed to other methods.

type

Character indicating the scheme to be used. It can be 'classical', 'iterative' or 'weighted'.

noiseAction

Character indicating what to do with noisy instances. It can be either 'remove' or 'repair'.

epsilon

If 'iterative' type is used, the loop will be stopped if the proportion of modified instances is less or equal than this threshold.

maxIter

Maximum number of iterations in 'iterative' type.

alpha

Parameter used in the computation of the similarity between two instances.

beta

It regulates the influence of the similarity metric in the estimation of a new label for an instance.

classColumn

positive integer indicating the column which contains the (factor of) classes. By default, the last column is considered.

Details

ModeFilter estimates the most appropriate class for each instance based on the similarity metric and the provided label. This can be addressed in three different ways (argument 'type'):

In the classical approach, all labels are tried for all instances, and the one maximizing a metric based on similarity is chosen. In the iterative approach, the same scheme is repeated until the proportion of modified instances is less than epsilon or the maximum number of iterations maxIter is reached. The weighted approach extends the classical one by assigning a weight for each instance, which quantifies the reliability on its label. This weights is utilized in the computation of the metric to be maximized.

Value

An object of class filter, which is a list with seven components:

  • cleanData is a data frame containing the filtered dataset.

  • remIdx is a vector of integers indicating the indexes for removed instances (i.e. their row number with respect to the original data frame).

  • repIdx is a vector of integers indicating the indexes for repaired/relabelled instances (i.e. their row number with respect to the original data frame).

  • repLab is a factor containing the new labels for repaired instances.

  • parameters is a list containing the argument values.

  • call contains the original call to the filter.

  • extraInf is a character that includes additional interesting information not covered by previous items.

References

Du W., Urahama K. (2010, November): Error-correcting semi-supervised pattern recognition with mode filter on graphs. In Aware Computing (ISAC), 2010 2nd International Symposium on (pp. 6-11). IEEE.

Examples

1
2
3
4
5
6
7
8
# Next example is not run because in some cases it can be rather slow
## Not run: 
   data(iris)
   out <- ModeFilter(Species~., data = iris, type = "classical", noiseAction = "remove")
   print(out)
   identical(out$cleanData, iris[setdiff(1:nrow(iris),out$remIdx),])

## End(Not run)