Source: https://github.com/neerajdhanraj/imputeTestbench/blob/master/Manuscript%20imputeTestbench.pdf

The default dataset if dataIn = NULL is nottem, a time series object of average air temperatures recorded at Nottingham Castle from 1920 to 1930. This dataset is included with the base datasets package.

library(imputeTestbench)
set.seed(123)

a <- impute_errors()
a

The errprof object is a list with seven elements. The first element, Parameter, is a character string of the error metric used for comparing imputation methods. The second element, MissingPercent, is a numeric vector of the missing percentages that were evaluated in the input dataset. The remaining five elements show the average error for each imputation method at each interval of missing data in MissingPercent. The averages at each interval are based on the repetitions specified in the initial call to impute_errors() where the default is Repetition = 1 . Although the print method for the errprof object returns a list, the object stores the unique error estimates for every imputation method, repetition, and missing data interval. These values are used to estimate the averages in the printed output and to plot the distribution of errors with plot_errors() shown below.

plot_errors(a)
plot_errors(a, plotType = 'line')
plot_impute(showmiss = T)

Demonstration of imputeTestbench with examples

The austres dataset is a ts object of Australian population in thousands, measured quarterly from 1971 to 1994 (Brockwell and Davis, 1996). The plot_errors() function shows that all imputation methods had larger error values with additional missing observations, as expected, and that the na.mean imputation method had the largest error values. Differences between the error values can be understood by viewing a sample of the imputed data with plot_impute(). The example belows shows an example of imputed values using the na.approx, na.locf, and na.mean functions at 10% and 90% missing observations using MCAR sampling.

aus <- datasets::austres
aus
str(aus)
summary(aus)
ex <- impute_errors((dataIn = aus))
ex
plot_errors(ex, plotType = 'line')
plot_impute(aus)
plot_impute(aus, methods = c("na.mean", "na.locf", "na.approx"), missPercent = 90)

Sample

Source: http://www.neerajbokde.com/cran/imputetestbench

datax <- c(1:5,1:5,1:5,1:5,1:5,1:5,1:5,1:5,1:5,1:5,1:5,1:5,1:5,1:5,1:5,1:5,1:5,1:5,1:5,1:5,1:5)
length(datax)
datax
library(imputeTestbench)

q <- impute_errors(datax)
q
plot_impute(datax, methods = c("na.mean", "na.locf", "na.approx"))
plot_errors(q)
plot_errors(dataIn = q, plotType = "line")
attrib <- attr(q, 'errall')
attrib

Applying on activities dataset

library(RepDataPeerAssessment1)

data("activity")
activity
ok <- complete.cases(activity$steps)
st <- activity$steps[ok]
length(st)
library(imputeTestbench)

itb <- impute_errors(st)
itb
plot_errors(itb)
plot_errors(itb, plotType = 'line')
plot_impute(st, methods = c("na.mean", "na.locf", "na.interp"), missPercent = 10)
err_mape <- impute_errors(st, errorParameter = "mae")
err_mape
plot_errors(dataIn = err_mape, plotType = "line")

attr will give us the values of the five methods (approx, interp, interpolation, locf, mean) for each of the nine NA percentage errors.

attr(itb, 'errall')

Adding an imputation function

As described above, imputation methods supplied by the user can be added to impute_errors(). The example below demonstrates the addition of a random number imputation method to the error profile. An R script file must be created for adding and saving the function. Additional functions can be added to the script as needed. User-supplied functions for imputation should use time series data with missing values as input and return the time series data with the imputed values as shown below.

The path where the R script is saved is used as an input string to the methodPath argument. The name of the new function is added to the methods argument, including any of the default methods used by impute_errors. Results are shown below.

ex <- impute_errors(dataIn = aus, methodPath = '../../R/imputation.R',
methods = c('na.mean', 'na.locf', 'na.approx', 'sss'))

ex


AlfonsoRReyes/RepDataPeerAssessment1 documentation built on May 5, 2019, 4:53 a.m.