Data quality simulation can be used to check the robustness of data analysis findings and learn about the impact of data quality contaminations on classification. This package helps to add contaminations (noise, missing values, outliers, low variance, irrelevant features, class swap (inconsistency), class imbalance and decrease in data volume) to data and then evaluate the simulated data sets for classification accuracy. As a lightweight solution simulation runs can be set up with no or minimal up-front effort.

Quick start

Example 1: Creating contaminations

The package can be used to create contaminated data sets. Preprosimrun() is the main execution function and its default settings create 6561 contaminated data sets. In the example below argument 'fitmodels' is set to FALSE (not to compute classification accuracies) and default setup is used (argument 'param' is not given).

res <- preprosimrun(iris, fitmodels=FALSE)

All contaminated data sets can be acquired as a list from the data slot:

datasets <- res@data

The data set corresponding to a specific combination of contaminations can be acquired as a dataframe with getpreprosimdf() function.

df <- getpreprosimdf(res, c(0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1))

The second argument above has the contamination parameters in the following order:

str(res@grid, give.attr=FALSE)

Example 2: Classification accuracy of contaminated data sets

Preprosimrun() function with default value fitmodels=TRUE can be used to fit models and compute classification accuracy for each contaminated data set. Note that the selected model must be able to deal with missing values AND have an in-build variable importance scoring. Only 'rpart' and 'gbm' models are tested.

Parameter object is controlling, which contaminations are applied. In the example below the impact of missing values (primary, 10 contamination levels) and noise (secondary, 3 contamination levels ) on classification accuracy is studied. Classifier 'rpart' is used as a model instead of default 'gbm' and two times repreated holdout rounds are used. Argument 'cores' is not given, using 1 core by default.

res <- preprosimrun(iris, param=newparam(iris, "custom", x="misval", z="noise"), caretmodel="rpart", holdoutrounds = 2, verbose=FALSE)

Specific dependencies between contaminations can be plotted by giving 'xz' argument to preprosimplot() function.

preprosimplot(res, "xz", x="misval", z="noise")

The corresponding result data can be acquired with getpreprosimdata() function. In the exampe below 'x' and 'y' in str() function output correspond to arguments given in preprosimplot() and no other contaminations are applied similar to design of experiment (all other parameter values set to 0 zero).

data <- getpreprosimdata(res, "xz", x="misval", z="noise")

Variable importance (i.e. robustness of variables in classification task) in the contaminated data sets can be plotted:

preprosimplot(res, "varimportance")


The package includes eight build-in contaminations with parameters as contamination intensities. Contamination names, contents, parameter ranges and core definitions are presented below. For full definitions, please see the source code.

  1. noise
  2. normal random number having original value in data as mean and parameter as standard deviation
  3. rnorm(length(x), x, param@noiseparam)
  4. lowvar (low variance)
  5. parameter by which the original value is moved towards the mean of the variable
  6. 0 = none, 1=all values are mean
  7. multiplierdifftomean <- lowvarianceparameter * scale(x, scale=FALSE)
  8. newvalue <- x - multiplierdifftomean
  9. misval (missing values)
  10. parameter for the share of missing values
  11. 0=none, 1 = all
  12. positionstomissingvalue <- sample(1:length(x), numberofmissingvalue)
  13. x[positionstomissingvalue] <- NA
  14. irfeature (irrelevant features)
  15. parameter for the share of irrelevant features generated
  16. 0 = none, 1 = as many as there are variables in the original data
  17. numberofirrelevantfeatures <- as.integer(param@irfeatureparam * ncol(data@x))
  18. basedata <- data.frame(basedata, newvar=runif(nrow(data@x), -1, 1))
  19. classswap (inconsistency)
  20. share of class labels that are swapped
  21. 0=none, 1=all
  22. classimbalance
  23. share of observations to be removed from the most frequent class
  24. 0=none, 1=all
  25. volumedecrease
  26. share of observations removed from the data
  27. 0=none, 1=all removed
  28. caret::createDataPartition(data@y, times = 1, p = param@volumedecreaseparam, list=FALSE)
  29. outlier
  30. number of observations replaced with +IQR1.5 to +IQR2.0 outlier
  31. 0=none, 1=all
  32. outliers <- runif(d, smallestoutlier, largestoutlier)
  33. tobereplaced <- sample(1:length(x),d)
  34. x[tobereplaced] <- outliers

The package author is happy to include suggested new contaminations. Please contact the package author.

Parameter structure

Each contamination has three sub parameters:

  1. cols as columns the contamination is applied to
  2. param as the parameter of the contamination itself (i.e. intensity of contamination)
  3. order as order in which the parameter is applied to the data.

Parameter construction

Parameter objects can be initialized with newparam constructor(). The constructor reads the data frame and sets the parameters. In the example below, first the parameters as set in a default manner, then as empty and lastly for a specific purpose.

pa <- newparam(iris)
pa1 <- newparam(iris, "empty")
pa2 <- newparam(iris, "custom", "misval", "noise")

Parameter change

Parameters of an existing parameter object can be changed with changeparam() function.

pa <- changeparam(pa, "noise", "cols", value=1)
pa <- changeparam(pa, "noise", "param", value=c(0,0.1))
pa <- changeparam(pa, "noise", "order", value=1)

Supporting packages

The data quality of a contaminated data set can be visualized with package preproviz. In the example below the data frame 'df' acquired above is visualized for data quality issue interdependencies.

viz <- preproviz(df)

In a similar manner the same data frame 'df' acquired above could be preprocesssed for optimal classification accuracy with package preprocomb.

grid <- setgrid(preprodefault, df)
result <- preprocomb(grid)

For further information, please see package preproviz and preprocomb vignettes.

Try the preprosim package in your browser

Any scripts or data that you put into this service are public.

preprosim documentation built on May 1, 2019, 6:27 p.m.