knnImputation: Fill in NA values with the values of the nearest neighbours

Description Usage Arguments Details Value Author(s) References See Also Examples

View source: R/utils.R

Description

Function that fills in all NA values using the k Nearest Neighbours of each case with NA values. By default it uses the values of the neighbours and obtains an weighted (by the distance to the case) average of their values to fill in the unknows. If meth='median' it uses the median/most frequent value, instead.

Usage

1
2
knnImputation(data, k = 10, scale = T, meth = "weighAvg",
              distData = NULL)

Arguments

data

A data frame with the data set

k

The number of nearest neighbours to use (defaults to 10)

scale

Boolean setting if the data should be scale before finding the nearest neighbours (defaults to T)

meth

String indicating the method used to calculate the value to fill in each NA. Available values are 'median' or 'weighAvg' (the default).

distData

Optionally you may sepecify here a data frame containing the data set that should be used to find the neighbours. This is usefull when filling in NA values on a test set, where you should use only information from the training set. This defaults to NULL, which means that the neighbours will be searched in data

Details

This function uses the k-nearest neighbours to fill in the unknown (NA) values in a data set. For each case with any NA value it will search for its k most similar cases and use the values of these cases to fill in the unknowns.

If meth='median' the function will use either the median (in case of numeric variables) or the most frequent value (in case of factors), of the neighbours to fill in the NAs. If meth='weighAvg' the function will use a weighted average of the values of the neighbours. The weights are given by exp(-dist(k,x) where dist(k,x) is the euclidean distance between the case with NAs (x) and the neighbour k.

Value

A data frame without NA values

Author(s)

Luis Torgo ltorgo@dcc.fc.up.pt

References

Torgo, L. (2010) Data Mining using R: learning with case studies, CRC Press (ISBN: 9781439810187).

http://www.dcc.fc.up.pt/~ltorgo/DataMiningWithR

See Also

centralImputation, centralValue, complete.cases, na.omit

Examples

1
2
3
data(algae)
cleanAlgae <- knnImputation(algae)
summary(cleanAlgae)

Example output

Loading required package: lattice
Loading required package: grid
    season       size       speed         mxPH            mnO2       
 autumn:40   large :45   high  :84   Min.   :5.600   Min.   : 1.500  
 spring:53   medium:84   low   :33   1st Qu.:7.700   1st Qu.: 7.775  
 summer:45   small :71   medium:83   Median :8.055   Median : 9.800  
 winter:62                           Mean   :8.011   Mean   : 9.129  
                                     3rd Qu.:8.400   3rd Qu.:10.800  
                                     Max.   :9.700   Max.   :13.400  
       Cl               NO3              NH4                oPO4       
 Min.   :  0.222   Min.   : 0.050   Min.   :    5.00   Min.   :  1.00  
 1st Qu.: 10.542   1st Qu.: 1.312   1st Qu.:   38.78   1st Qu.: 15.37  
 Median : 32.178   Median : 2.675   Median :  103.17   Median : 40.15  
 Mean   : 42.661   Mean   : 3.277   Mean   :  498.62   Mean   : 73.60  
 3rd Qu.: 57.775   3rd Qu.: 4.421   3rd Qu.:  227.89   3rd Qu.:100.50  
 Max.   :391.500   Max.   :45.650   Max.   :24064.00   Max.   :564.60  
      PO4             Chla             a1              a2        
 Min.   :  1.0   Min.   :  0.2   Min.   : 0.00   Min.   : 0.000  
 1st Qu.: 40.5   1st Qu.:  2.0   1st Qu.: 1.50   1st Qu.: 0.000  
 Median :103.3   Median :  5.2   Median : 6.95   Median : 3.000  
 Mean   :137.7   Mean   : 13.4   Mean   :16.92   Mean   : 7.458  
 3rd Qu.:214.0   3rd Qu.: 17.2   3rd Qu.:24.80   3rd Qu.:11.375  
 Max.   :771.6   Max.   :110.5   Max.   :89.80   Max.   :72.600  
       a3               a4               a5               a6        
 Min.   : 0.000   Min.   : 0.000   Min.   : 0.000   Min.   : 0.000  
 1st Qu.: 0.000   1st Qu.: 0.000   1st Qu.: 0.000   1st Qu.: 0.000  
 Median : 1.550   Median : 0.000   Median : 1.900   Median : 0.000  
 Mean   : 4.309   Mean   : 1.992   Mean   : 5.064   Mean   : 5.964  
 3rd Qu.: 4.925   3rd Qu.: 2.400   3rd Qu.: 7.500   3rd Qu.: 6.925  
 Max.   :42.800   Max.   :44.600   Max.   :44.400   Max.   :77.600  
       a7        
 Min.   : 0.000  
 1st Qu.: 0.000  
 Median : 1.000  
 Mean   : 2.495  
 3rd Qu.: 2.400  
 Max.   :31.600  

DMwR documentation built on May 1, 2019, 9:17 p.m.