dataFiller: Missing Observations Filling Function

Description Usage Arguments Details Value Note Author(s) References See Also Examples

View source: R/dataFiller.R

Description

fill in the missing observations in a dataset by exploring similarities between cases

Usage

1
dataFiller(data, NAstring = NA)

Arguments

data

a dataset that contains missing observations in some cases

NAstring

a character or string that denotes missing values in the input dataset

Details

fill the cases with missing observations by finding the median of 10 most similar cases with the current one. Of course, the missing in the same column of the 10 cases will be removed when calculating the median. The criterion we define "similar" is based on euclidian distance between standardized cases

Value

A complete data set with missing observations filled will be returned.

Note

The cases with missing values in the input dataset will be printed on the screen instead of being returned. The return will be only the complete data set with missing observations filled.

Author(s)

Boxian Wei(The ideas are inspired by Luis Torgo, and thanks)

References

Luis Torgo (2003) Data Mining with R:learning by case studies. LIACC-FEP, University of Porto

See Also

knnMCN, knnVCN

Examples

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
##---- Should be DIRECTLY executable !! ----
##-- ==>  Define data, use random,
##--	or do  help(data=index)  for the standard data sets.

## Define Data 
library(knnGarden)
data(iris)
v1=c(iris[1:4,3],NA,iris[6:10,3])
v2=iris[101:110,4]
v3=iris[101:110,1]
v4=c(iris[11:18,3],NA,iris[20,3])
data1=data.frame(v1,v2,v3,v4)

## Call Function
data2=dataFiller(data1)

## The function is currently defined as
function (data, NAstring = NA) 
{
    central.value <- function(x) {
        if (is.numeric(x)) 
            median(x, na.rm = T)
        else if (is.factor(x)) 
            levels(x)[which.max(table(x))]
        else {
            f <- as.factor(x)
            levels(f)[which.max(table(f))]
        }
    }
    dist.mtx <- as.matrix(daisy(data, stand = T))
    ShowMissing = NULL
    ShowMissing = data[which(!complete.cases(data)), ]
    for (r in which(!complete.cases(data))) data[r, which(is.na(data[r, 
        ]))] <- apply(data.frame(data[c(as.integer(names(sort(dist.mtx[r, 
        ])[2:11]))), which(is.na(data[r, ]))]), 2, central.value)
    cat("the missing case(s) in the orignal dataset ", "\n\n")
    print(ShowMissing)
    cat("\n\n")
    return(data)
  }

knnGarden documentation built on May 2, 2019, 11:02 a.m.