Data quality simulation can be used to check the robustness of data analysis findings and learn about the impact of data quality contaminations on classification. This package helps to add contaminations (noise, missing values, outliers, low variance, irrelevant features, class swap (inconsistency), class imbalance and decrease in data volume) to data and then evaluate the simulated data sets for classification accuracy. As a lightweight solution simulation runs can be set up with no or minimal up-front effort.
The package can be used to create contaminated data sets. Preprosimrun() is the main execution function and its default settings create 6561 contaminated data sets. In the example below argument 'fitmodels' is set to FALSE (not to compute classification accuracies) and default setup is used (argument 'param' is not given).
library(preprosim) res <- preprosimrun(iris, fitmodels=FALSE)
All contaminated data sets can be acquired as a list from the data slot:
datasets <- res@data length(datasets)
The data set corresponding to a specific combination of contaminations can be acquired as a dataframe with getpreprosimdf() function.
df <- getpreprosimdf(res, c(0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1)) head(df)
The second argument above has the contamination parameters in the following order:
Preprosimrun() function with default value fitmodels=TRUE can be used to fit models and compute classification accuracy for each contaminated data set. Note that the selected model must be able to deal with missing values AND have an in-build variable importance scoring. Only 'rpart' and 'gbm' models are tested.
Parameter object is controlling, which contaminations are applied. In the example below the impact of missing values (primary, 10 contamination levels) and noise (secondary, 3 contamination levels ) on classification accuracy is studied. Classifier 'rpart' is used as a model instead of default 'gbm' and two times repreated holdout rounds are used. Argument 'cores' is not given, using 1 core by default.
res <- preprosimrun(iris, param=newparam(iris, "custom", x="misval", z="noise"), caretmodel="rpart", holdoutrounds = 2, verbose=FALSE) preprosimplot(res)
Specific dependencies between contaminations can be plotted by giving 'xz' argument to preprosimplot() function.
preprosimplot(res, "xz", x="misval", z="noise")
The corresponding result data can be acquired with getpreprosimdata() function. In the exampe below 'x' and 'y' in str() function output correspond to arguments given in preprosimplot() and no other contaminations are applied similar to design of experiment (all other parameter values set to 0 zero).
data <- getpreprosimdata(res, "xz", x="misval", z="noise") str(data)
Variable importance (i.e. robustness of variables in classification task) in the contaminated data sets can be plotted:
The package includes eight build-in contaminations with parameters as contamination intensities. Contamination names, contents, parameter ranges and core definitions are presented below. For full definitions, please see the source code.
The package author is happy to include suggested new contaminations. Please contact the package author.
Each contamination has three sub parameters:
Parameter objects can be initialized with newparam constructor(). The constructor reads the data frame and sets the parameters. In the example below, first the parameters as set in a default manner, then as empty and lastly for a specific purpose.
pa <- newparam(iris) pa1 <- newparam(iris, "empty") pa2 <- newparam(iris, "custom", "misval", "noise")
Parameters of an existing parameter object can be changed with changeparam() function.
pa <- changeparam(pa, "noise", "cols", value=1) pa <- changeparam(pa, "noise", "param", value=c(0,0.1)) pa <- changeparam(pa, "noise", "order", value=1)
The data quality of a contaminated data set can be visualized with package preproviz. In the example below the data frame 'df' acquired above is visualized for data quality issue interdependencies.
library(preproviz) viz <- preproviz(df) plotVARCLUST(viz)
In a similar manner the same data frame 'df' acquired above could be preprocesssed for optimal classification accuracy with package preprocomb.
library(preprocomb) grid <- setgrid(preprodefault, df) result <- preprocomb(grid) result@bestclassification
For further information, please see package preproviz and preprocomb vignettes.
Any scripts or data that you put into this service are public.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.