Description Usage Arguments Value References Examples
View source: R/machineLearning.R
From an xts object, produce more or less jittered or duplicate nearby observations. The workhorse package here is the R CRAN package UBL (Utility Based Learning) and its *Regress functions. This is a smart(er) wrapper.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 | GaussNoiseRegress : function (form, dat, rel = "auto", thr.rel = 0.5,
C.perc = "balance", pert = 0.1, repl = FALSE)
# default
# from current data, makes "exact replicated" copies
ImpSampRegress : function (form, dat, rel = "auto", thr.rel = NA,
C.perc = "balance", O = 0.5, U = 0.5)
RandOverRegress : function (form, dat, rel = "auto", thr.rel = 0.5,
C.perc = "balance", repl = TRUE)
# from current data, makes "jittered" copies
RandUnderRegress : function (form, dat, rel = "auto", thr.rel = 0.5,
C.perc = "balance", repl = FALSE)
SmoteRegress : function (form, dat, rel = "auto", thr.rel = 0.5,
C.perc = "balance", k = 5, repl = FALSE,
dist = "Euclidean", p = 2)
UtilOptimRegress : function (form, train, test, type = "util",
strat = "interpol",
strat.parms = list(method = "bilinear"),
control.parms, m.pts, minds, maxds, eps = 0.1)
# Help with UtilOptimRegress(just above) parameter control.parms
phi.control : function (y, method = "extremes", extr.type = "both",
coef = 1.5, control.pts = NULL)
|
1 2 3 4 5 6 7 8 9 |
x |
xts object of training data. Default is none. Required. |
x2 |
xts object of testing data. Default is NULL. Required in UtilOptimRegress. Only used in UtilOptimRegress. Otherwise an error. |
Fmla |
Default is NULL. Required. Formula that is sent to the UBL function. |
TrainDates |
Default is NULL. Not Required. Absolute training start dates(times) and end dates(times) as a vector of a pair. Alternately, this can be a list of vectors of pairs. |
TestDates |
Default is NULL. Not Required. This parameter can only be used with UtilOptimRegress. Absolute testing start dates(times) and end dates(times) as a vector of a pair. Alternately, this can be a list of vectors of pairs. |
UBLFunction |
Default is NULL. Default is the ImpSampRegress function. Not Required. An R Package UBL *Regress function. Enter the functoin name enclosed in a "string" or bare function name. |
... |
Dots passed to the UBL function. Defaults follow. thr.rel = 0.5. C.perc = list(1, 2) : means make the important data to be from single in size to double in size. Relevance function (rel): xts coredata values greater than zero are important. In opposite, xts coredata values less than zero are not important. |
Modified xts that ahs removed data and/or has duplicate(multiplicate) index items at the same time points in time with the "jittered" coredata values or "exact replicated" coredata values.
SmoteRegress challenges #2 https://github.com/paobranco/UBL/issues/2
question about new/replicated UBL data and range of creation area #3 https://github.com/paobranco/UBL/issues/3
P. Branco, L. Torgo and R.P. Ribeiro, Pre-processing approaches for imbalanced distributions in regression, Neurocomputing, https://doi.org/10.1016/j.neucom.2018.11.100 https://web.cs.dal.ca/~branco/PDFfiles/j14.pdf
Volume 74 by the Proceedings of Machine Learning Research on 11 October 2017 https://github.com/mlresearch/v74
(BROKEN LINK) Luis Torgo: Learning with Imbalanced Domains, a tutorial, 2nd International Workshop on Learning with Imbalanced Domains: Theory and Applications Co-located with ECML/PKDD 2018 http://lidta.dcc.fc.up.pt/Slides/TutorialLIDTA.pdf
Paula Branco, Rita P. Ribeiro, Luis Torgo: UBL: an R package for Utility-based Learning, (Submitted on 27 Apr 2016 (v1), last revised 12 Jul 2016 (this version, v2)) https://arxiv.org/abs/1604.0807
Ribeiro, R.P.: Utility-based Regression. PhD thesis, Dep. Computer Science, Faculty of Sciences - University of Porto (2011), Chapter 3 Utility-based Regression https://www.dcc.fc.up.pt/~rpribeiro/publ/rpribeiroPhD11.pdf
Paula Branco and Luis Torgo and Rita Ribeiro: A Survey of Predictive Modeling on Imbalanced Domains, ACM Comput. Surv., 2016 volume 49 number 2-31 https://web.cs.dal.ca/~ltorgo/publication/2016_btr16/2016_BTR16.pdf
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 | ## Not run:
set.seed(1L)
DataValues <- data.frame(x = as.numeric(seq_len(1000)), y = rnorm(1000, 0, 1))
row.names(DataValues) <- seq_len(1000)
table(DataValues$y > 0.00)
FALSE TRUE
518 482
# Relevance function
Rlvce <- matrix(c(-0.01, 0, 0, 0.00, 0.5, 0.5, 0.01, 1, 0), ncol = 3, byrow = T,
dimnames = list(
yvalues = character(),
col = c("yvalues", "relevance", "slope_of_y_values")
)
)
# Relevant observations: import to me.
# I want MORE of these "relevant" observations
# (compared to "not very relevant" observations.)
#
# yvalues: negative(-) values are not VERY relevant
# yvalues: positive(+) values are VERY relevant
# relevance column: 0 - not very relevant, 1 - very relevant
#
# Relevance function defines a graphic with a smooth non-strait line
# It uses exactly only: yvalues and slope_of_yvalues
# see the references.
# This Relevance function is a curved line of half of a hill.
#
Rlvce
# +/-
col
yvalues yvalues relevance slope_of_y_values
[1,] -0.01 0.0 0.0 # yvalues less than thr.rel (bottom of hill)
[2,] 0.00 0.5 0.5 # relevance col: thr.rel = 0.5
[3,] 0.01 1.0 0.0 # yvalues greater than thr.rel (top of hill)
# default "threashold of relevance" (thr.rel) between "not very relevant" and "relavent"
# ranges
# "thr.rel = 0.5" # [1,]->[2,] [2,]->[3,]
Results <- UBL::SmoteRegress(y ~ ., DataValues, rel = Rlvce, C.perc = list(0.5, 2.5))
# no change
Results <- UBL::SmoteRegress(y ~ ., DataValues, rel = Rlvce, C.perc = list(1, 1))
> identical(sort(DataValues[,"x"]), sort(Results[,"x"]))
[1] TRUE
> identical(sort(DataValues[,"y"]), sort(Results[,"y"]))
[1] TRUE
# new jitters of the current data
#
# double the number of (important) revelant observations
# default "thr.rel = 0.5"
# # 100% percent, # 200% percent
Results <- UBL::SmoteRegress(y ~ ., DataValues, rel = Rlvce, C.perc = list(1, 2))
table(Results$y > 0.00)
FALSE TRUE
518 964
# new replicas of the current data
#
# default "thr.rel = NA" # to create/destroy obs like smote (thr.rel = 0.5)
Results <- UBL::ImpSampRegress(y ~ ., DataValues, rel = Rlvce, thr.rel = 0.5, C.perc = list(1, 2))
table(Results$y > 0.00)
FALSE TRUE
518 964
# see the replicated data points
tail(Results[order(Results$x),],30)
# Results[order(as.integer(row.names(Results))),]
# half the number of (un-important) not very relevant observations
#
Results <- UBL::SmoteRegress(y ~ ., DataValues, rel = Rlvce, C.perc = list(0.5, 1))
table(Results$y > 0.00)
FALSE TRUE
259 482
Results <- UBL::ImpSampRegress(y ~ ., DataValues, rel = Rlvce, thr.rel = 0.5, C.perc = list(0.5, 1))
table(Results$y > 0.00)
FALSE TRUE
259 482
# xts object
DataIndex <- zoo::as.Date(0L:999L)
DataXts <- xts::as.xts(DataValues, DataIndex, dateFormat= "Date")
table(DataXts[,"y"] > 0.00)
# double the "important" data (jitters)
ResultsXts <- rebalanceData(y ~ ., DataXts, UBLFunction = "UBL::SmoteRegress")
table(ResultsXts[,"y"] > 0.00)
# double the "important" data (exact data)
ResultsXts <- rebalanceData(y ~ ., DataXts)
table(ResultsXts[,"y"] > 0.00)
# half the "not important" data
ResultsXts <- rebalanceData(y ~ ., DataXts, C.perc = list(0.5, 1))
table(ResultsXts[,"y"] > 0.00)
## End(Not run)
|
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.