Description Usage Arguments Value Author(s) References See Also Examples
This strategy performs both over-sampling and under-sampling. The under-sampling is randomly performed on the examples below the relevance threshold defined by the user. Regarding the over-sampling method, this is based on the generation of new synthetic examples with the introduction of a small perturbation on existing examples through Gaussian noise. A new example from a rare "class"" is obtained by perturbing all the features and the target variable a percentage of its standard deviation (evaluated on the rare examples). The value of nominal features of the new example is randomly selected according to the frequency of the values existing in the rare cases of the bump in consideration.
1 2 | GaussNoiseRegress(form, dat, rel = "auto", thr.rel = 0.5, C.perc = "balance",
pert = 0.1, repl = FALSE)
|
form |
A formula describing the prediction problem |
dat |
A data frame containing the original (unbalanced) data set |
rel |
The relevance function which can be automatically ("auto") determined (the default) or may be provided by the user through a matrix with interpolating points. |
thr.rel |
A number indicating the relevance threshold above which a case is considered as belonging to the rare "class". |
C.perc |
A list containing the percentage(s) of under- or/and over-sampling to apply to each "class" (bump) obtained with the threshold. The |
pert |
A number indicating the level of perturbation to introduce when generating synthetic examples. Assuming as center the base example, this parameter defines the radius (based on the standard deviation) where the new example is generated. |
repl |
A boolean value controlling the possibility of having repetition of examples when performing under-sampling by selecting among the "normal" examples. |
The function returns a data frame with the new data set resulting from the application of random under-sampling and over-sampling through the generation of synthetic examples using Gaussian noise.
Paula Branco paobranco@gmail.com, Rita Ribeiro rpribeiro@dcc.fc.up.pt and Luis Torgo ltorgo@dcc.fc.up.pt
Sauchi Stephen Lee. (1999) Regularization in skewed binary classification. Computational Statistics Vol.14, Issue 2, 277-292.
Sauchi Stephen Lee. (2000) Noisy replication in skewed binary classification. Computaional stistics and data analysis Vol.34, Issue 2, 165-191.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 | if (requireNamespace("DMwR2", quietly = TRUE)) {
data(algae, package ="DMwR2")
clean.algae <- data.frame(algae[complete.cases(algae), ])
C.perc = list(0.5, 3)
mygn.alg <- GaussNoiseRegress(a7~., clean.algae, C.perc = C.perc)
gnB.alg <- GaussNoiseRegress(a7~., clean.algae, C.perc = "balance",
pert = 0.1)
gnE.alg <- GaussNoiseRegress(a7~., clean.algae, C.perc = "extreme")
plot(density(clean.algae$a7))
lines(density(gnE.alg$a7), col = 2)
lines(density(gnB.alg$a7), col = 3)
lines(density(mygn.alg$a7), col = 4)
} else {
ir <- iris[-c(95:130), ]
mygn1.iris <- GaussNoiseRegress(Sepal.Width~., ir, C.perc = list(0.5, 2.5))
mygn2.iris <- GaussNoiseRegress(Sepal.Width~., ir, C.perc = list(0.2, 4),
thr.rel = 0.8)
gnB.iris <- GaussNoiseRegress(Sepal.Width~., ir, C.perc = "balance")
gnE.iris <- GaussNoiseRegress(Sepal.Width~., ir, C.perc = "extreme")
# defining a relevance function
rel <- matrix(0, ncol = 3, nrow = 0)
rel <- rbind(rel, c(2, 1, 0))
rel <- rbind(rel, c(3, 0, 0))
rel <- rbind(rel, c(4, 1, 0))
gn.rel <- GaussNoiseRegress(Sepal.Width~., ir, rel = rel,
C.perc = list(5, 0.2, 5))
plot(density(ir$Sepal.Width), ylim = c(0,1))
lines(density(gnB.iris$Sepal.Width), col = 3)
lines(density(gnE.iris$Sepal.Width, bw = 0.3), col = 4)
# check the impact of a different relevance threshold
lines(density(gn.rel$Sepal.Width), col = 2)
}
|
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.