Home

/

CRAN

/

preprosim

/

Preprosim"
In preprosim: Lightweight Data Quality Simulation for Classification

Data quality simulation can be used to check the robustness of data analysis findings and learn about the impact of data quality contaminations on classification. This package helps to add contaminations (noise, missing values, outliers, low variance, irrelevant features, class swap (inconsistency), class imbalance and decrease in data volume) to data and then evaluate the simulated data sets for classification accuracy. As a lightweight solution simulation runs can be set up with no or minimal up-front effort.

Quick start

Example 1: Creating contaminations

The package can be used to create contaminated data sets. Preprosimrun() is the main execution function and its default settings create 6561 contaminated data sets. In the example below argument 'fitmodels' is set to FALSE (not to compute classification accuracies) and default setup is used (argument 'param' is not given).

library(preprosim)
res <- preprosimrun(iris, fitmodels=FALSE)

All contaminated data sets can be acquired as a list from the data slot:

datasets <- res@data
length(datasets)

The data set corresponding to a specific combination of contaminations can be acquired as a dataframe with getpreprosimdf() function.

df <- getpreprosimdf(res, c(0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1))
head(df)

The second argument above has the contamination parameters in the following order:

str(res@grid, give.attr=FALSE)

Example 2: Classification accuracy of contaminated data sets

Preprosimrun() function with default value fitmodels=TRUE can be used to fit models and compute classification accuracy for each contaminated data set. Note that the selected model must be able to deal with missing values AND have an in-build variable importance scoring. Only 'rpart' and 'gbm' models are tested.

Parameter object is controlling, which contaminations are applied. In the example below the impact of missing values (primary, 10 contamination levels) and noise (secondary, 3 contamination levels ) on classification accuracy is studied. Classifier 'rpart' is used as a model instead of default 'gbm' and two times repreated holdout rounds are used. Argument 'cores' is not given, using 1 core by default.

res <- preprosimrun(iris, param=newparam(iris, "custom", x="misval", z="noise"), caretmodel="rpart", holdoutrounds = 2, verbose=FALSE)
preprosimplot(res)

Specific dependencies between contaminations can be plotted by giving 'xz' argument to preprosimplot() function.

preprosimplot(res, "xz", x="misval", z="noise")

The corresponding result data can be acquired with getpreprosimdata() function. In the exampe below 'x' and 'y' in str() function output correspond to arguments given in preprosimplot() and no other contaminations are applied similar to design of experiment (all other parameter values set to 0 zero).

data <- getpreprosimdata(res, "xz", x="misval", z="noise")
str(data)

Variable importance (i.e. robustness of variables in classification task) in the contaminated data sets can be plotted:

preprosimplot(res, "varimportance")

Customization

The package includes eight build-in contaminations with parameters as contamination intensities. Contamination names, contents, parameter ranges and core definitions are presented below. For full definitions, please see the source code.

noise
normal random number having original value in data as mean and parameter as standard deviation
rnorm(length(x), x, param@noiseparam)
lowvar (low variance)
parameter by which the original value is moved towards the mean of the variable
0 = none, 1=all values are mean
multiplierdifftomean <- lowvarianceparameter * scale(x, scale=FALSE)
newvalue <- x - multiplierdifftomean
misval (missing values)
parameter for the share of missing values
0=none, 1 = all
positionstomissingvalue <- sample(1:length(x), numberofmissingvalue)
x[positionstomissingvalue] <- NA
irfeature (irrelevant features)
parameter for the share of irrelevant features generated
0 = none, 1 = as many as there are variables in the original data
numberofirrelevantfeatures <- as.integer(param@irfeatureparam * ncol(data@x))
basedata <- data.frame(basedata, newvar=runif(nrow(data@x), -1, 1))
classswap (inconsistency)
share of class labels that are swapped
0=none, 1=all
classimbalance
share of observations to be removed from the most frequent class
0=none, 1=all
volumedecrease
share of observations removed from the data
0=none, 1=all removed
caret::createDataPartition(data@y, times = 1, p = param@volumedecreaseparam, list=FALSE)
outlier
number of observations replaced with +IQR1.5 to +IQR2.0 outlier
0=none, 1=all
outliers <- runif(d, smallestoutlier, largestoutlier)
tobereplaced <- sample(1:length(x),d)
x[tobereplaced] <- outliers

The package author is happy to include suggested new contaminations. Please contact the package author.

Parameter structure

Each contamination has three sub parameters:

cols as columns the contamination is applied to
param as the parameter of the contamination itself (i.e. intensity of contamination)
order as order in which the parameter is applied to the data.

Parameter construction

Parameter objects can be initialized with newparam constructor(). The constructor reads the data frame and sets the parameters. In the example below, first the parameters as set in a default manner, then as empty and lastly for a specific purpose.

pa <- newparam(iris)
pa1 <- newparam(iris, "empty")
pa2 <- newparam(iris, "custom", "misval", "noise")

Parameter change

Parameters of an existing parameter object can be changed with changeparam() function.

pa <- changeparam(pa, "noise", "cols", value=1)
pa <- changeparam(pa, "noise", "param", value=c(0,0.1))
pa <- changeparam(pa, "noise", "order", value=1)

Supporting packages

The data quality of a contaminated data set can be visualized with package preproviz. In the example below the data frame 'df' acquired above is visualized for data quality issue interdependencies.

library(preproviz)
viz <- preproviz(df)
plotVARCLUST(viz)

In a similar manner the same data frame 'df' acquired above could be preprocesssed for optimal classification accuracy with package preprocomb.

library(preprocomb)
grid <- setgrid(preprodefault, df)
result <- preprocomb(grid)
result@bestclassification

For further information, please see package preproviz and preprocomb vignettes.

Any scripts or data that you put into this service are public.

preprosim documentation built on May 1, 2019, 6:27 p.m.

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

preprosim
Lightweight Data Quality Simulation for Classification

Preprosim"
In preprosim: Lightweight Data Quality Simulation for Classification

Quick start

Example 1: Creating contaminations

Example 2: Classification accuracy of contaminated data sets

Customization

Parameter structure

Parameter construction

Parameter change

Supporting packages

Try the preprosim package in your browser

R Package Documentation

Browse R Packages

We want your feedback!

preprosim Lightweight Data Quality Simulation for Classification

Preprosim" In preprosim: Lightweight Data Quality Simulation for Classification

Quick start

Example 1: Creating contaminations

Example 2: Classification accuracy of contaminated data sets

Customization

Parameter structure

Parameter construction

Parameter change

Supporting packages

Try the preprosim package in your browser

R Package Documentation

Browse R Packages

We want your feedback!

preprosim
Lightweight Data Quality Simulation for Classification

Preprosim"
In preprosim: Lightweight Data Quality Simulation for Classification