rebalanceData: Create/Remove More or Less Observations

Description Usage Arguments Value References Examples

View source: R/machineLearning.R

Description

From an xts object, produce more or less jittered or duplicate nearby observations. The workhorse package here is the R CRAN package UBL (Utility Based Learning) and its *Regress functions. This is a smart(er) wrapper.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
GaussNoiseRegress : function (form, dat, rel = "auto", thr.rel = 0.5,
                              C.perc = "balance", pert = 0.1, repl = FALSE)

# default
# from current data, makes "exact replicated" copies
ImpSampRegress : function (form, dat, rel = "auto", thr.rel = NA,
                           C.perc = "balance", O = 0.5, U = 0.5)

RandOverRegress : function (form, dat, rel = "auto", thr.rel = 0.5,
                            C.perc = "balance", repl = TRUE)

# from current data, makes "jittered" copies
RandUnderRegress : function (form, dat, rel = "auto", thr.rel = 0.5,
                             C.perc = "balance", repl = FALSE)

SmoteRegress : function (form, dat, rel = "auto", thr.rel = 0.5,
                         C.perc = "balance", k = 5, repl = FALSE,
                         dist = "Euclidean", p = 2)

UtilOptimRegress : function (form, train, test, type = "util",
                             strat = "interpol",
                             strat.parms = list(method = "bilinear"),
                             control.parms, m.pts, minds, maxds, eps = 0.1)

# Help with UtilOptimRegress(just above) parameter control.parms

    phi.control : function (y, method = "extremes", extr.type = "both",
                            coef = 1.5, control.pts = NULL)

Usage

1
2
3
4
5
6
7
8
9
rebalanceData(
  x,
  x2 = NULL,
  Fmla = NULL,
  TrainDates = NULL,
  TestDates = NULL,
  UBLFunction = NULL,
  ...
)

Arguments

x

xts object of training data. Default is none. Required.

x2

xts object of testing data. Default is NULL. Required in UtilOptimRegress. Only used in UtilOptimRegress. Otherwise an error.

Fmla

Default is NULL. Required. Formula that is sent to the UBL function.

TrainDates

Default is NULL. Not Required. Absolute training start dates(times) and end dates(times) as a vector of a pair. Alternately, this can be a list of vectors of pairs.

TestDates

Default is NULL. Not Required. This parameter can only be used with UtilOptimRegress. Absolute testing start dates(times) and end dates(times) as a vector of a pair. Alternately, this can be a list of vectors of pairs.

UBLFunction

Default is NULL. Default is the ImpSampRegress function. Not Required. An R Package UBL *Regress function. Enter the functoin name enclosed in a "string" or bare function name.

...

Dots passed to the UBL function. Defaults follow. thr.rel = 0.5. C.perc = list(1, 2) : means make the important data to be from single in size to double in size. Relevance function (rel): xts coredata values greater than zero are important. In opposite, xts coredata values less than zero are not important.

Value

Modified xts that ahs removed data and/or has duplicate(multiplicate) index items at the same time points in time with the "jittered" coredata values or "exact replicated" coredata values.

References

SmoteRegress challenges #2 https://github.com/paobranco/UBL/issues/2

question about new/replicated UBL data and range of creation area #3 https://github.com/paobranco/UBL/issues/3

P. Branco, L. Torgo and R.P. Ribeiro, Pre-processing approaches for imbalanced distributions in regression, Neurocomputing, https://doi.org/10.1016/j.neucom.2018.11.100 https://web.cs.dal.ca/~branco/PDFfiles/j14.pdf

Volume 74 by the Proceedings of Machine Learning Research on 11 October 2017 https://github.com/mlresearch/v74

(BROKEN LINK) Luis Torgo: Learning with Imbalanced Domains, a tutorial, 2nd International Workshop on Learning with Imbalanced Domains: Theory and Applications Co-located with ECML/PKDD 2018 http://lidta.dcc.fc.up.pt/Slides/TutorialLIDTA.pdf

Paula Branco, Rita P. Ribeiro, Luis Torgo: UBL: an R package for Utility-based Learning, (Submitted on 27 Apr 2016 (v1), last revised 12 Jul 2016 (this version, v2)) https://arxiv.org/abs/1604.0807

Ribeiro, R.P.: Utility-based Regression. PhD thesis, Dep. Computer Science, Faculty of Sciences - University of Porto (2011), Chapter 3 Utility-based Regression https://www.dcc.fc.up.pt/~rpribeiro/publ/rpribeiroPhD11.pdf

Paula Branco and Luis Torgo and Rita Ribeiro: A Survey of Predictive Modeling on Imbalanced Domains, ACM Comput. Surv., 2016 volume 49 number 2-31 https://web.cs.dal.ca/~ltorgo/publication/2016_btr16/2016_BTR16.pdf

Examples

  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
## Not run: 
set.seed(1L)
DataValues <- data.frame(x = as.numeric(seq_len(1000)), y = rnorm(1000, 0, 1))
row.names(DataValues) <- seq_len(1000)

table(DataValues$y > 0.00)
FALSE  TRUE
518   482

# Relevance function
Rlvce <- matrix(c(-0.01, 0, 0, 0.00, 0.5, 0.5, 0.01, 1, 0), ncol = 3, byrow = T,
                dimnames = list(
                  yvalues = character(),
                  col = c("yvalues", "relevance", "slope_of_y_values")
                )
)

# Relevant observations: import to me.
# I want MORE of these "relevant" observations
# (compared to "not very relevant" observations.)
#
# yvalues: negative(-) values are not VERY relevant
# yvalues: positive(+) values are VERY relevant
# relevance column:  0 - not very relevant, 1 - very relevant
#
# Relevance function defines a graphic with a smooth non-strait line
# It uses exactly only: yvalues and slope_of_yvalues
# see the references.
# This Relevance function is a curved line of half of a hill.
#
Rlvce
# +/-
col
yvalues yvalues relevance slope_of_y_values
[1,]   -0.01       0.0               0.0 # yvalues less than thr.rel (bottom of hill)
[2,]    0.00       0.5               0.5 # relevance col: thr.rel = 0.5
[3,]    0.01       1.0               0.0 # yvalues greater than thr.rel (top of hill)

# default "threashold of relevance" (thr.rel) between "not very relevant" and "relavent"
# ranges
# "thr.rel = 0.5"                                                # [1,]->[2,] [2,]->[3,]
Results <- UBL::SmoteRegress(y ~ ., DataValues, rel = Rlvce, C.perc = list(0.5, 2.5))

# no change
Results <- UBL::SmoteRegress(y ~ ., DataValues, rel = Rlvce, C.perc = list(1, 1))
> identical(sort(DataValues[,"x"]), sort(Results[,"x"]))
[1] TRUE
> identical(sort(DataValues[,"y"]), sort(Results[,"y"]))
[1] TRUE

# new jitters of the current data
#
# double the number of (important) revelant observations
# default "thr.rel = 0.5"
#                                           # 100% percent, # 200% percent
Results <- UBL::SmoteRegress(y ~ ., DataValues, rel = Rlvce, C.perc = list(1, 2))

table(Results$y > 0.00)
FALSE  TRUE
518   964

# new replicas of the current data
#
# default "thr.rel = NA" # to create/destroy obs like smote (thr.rel = 0.5)
Results <- UBL::ImpSampRegress(y ~ ., DataValues, rel = Rlvce, thr.rel = 0.5, C.perc = list(1, 2))
table(Results$y > 0.00)
FALSE  TRUE
518   964

# see the replicated data points
tail(Results[order(Results$x),],30)
# Results[order(as.integer(row.names(Results))),]

# half the number of (un-important) not very relevant observations
#
Results <- UBL::SmoteRegress(y ~ ., DataValues, rel = Rlvce, C.perc = list(0.5, 1))
table(Results$y > 0.00)
FALSE  TRUE
259   482

Results <- UBL::ImpSampRegress(y ~ ., DataValues, rel = Rlvce, thr.rel = 0.5, C.perc = list(0.5, 1))
table(Results$y > 0.00)
FALSE  TRUE
259   482

# xts object

DataIndex  <- zoo::as.Date(0L:999L)
DataXts <- xts::as.xts(DataValues, DataIndex, dateFormat= "Date")
table(DataXts[,"y"] > 0.00)

# double the "important" data (jitters)
ResultsXts <- rebalanceData(y ~ ., DataXts, UBLFunction = "UBL::SmoteRegress")
table(ResultsXts[,"y"] > 0.00)

# double the "important" data (exact data)
ResultsXts <- rebalanceData(y ~ ., DataXts)
table(ResultsXts[,"y"] > 0.00)

# half the "not important" data
ResultsXts <- rebalanceData(y ~ ., DataXts, C.perc = list(0.5, 1))
table(ResultsXts[,"y"] > 0.00)

## End(Not run)

AndreMikulec/econModel documentation built on June 30, 2021, 9:48 a.m.