randOverRegress_ST: Biased over-sampling for imbalanced regression...

Description Usage Arguments Value References See Also

View source: R/sampling_methods.R

Description

Based on randOverRegress (R package UBL). This function performs a random over-sampling strategy for imbalanced regression problems with a bias based on spatio-temporal contextual information. Basically a percentage of cases of the "class(es)" (bumps above a relevance threshold defined) selected by the user are randomly over-sampled with a sampling bias based on a spatio-temporal weight. Alternatively, it can either balance all the existing "classes" (the default) or it can "smoothly invert" the frequency of the examples in each class.

Usage

1
2
3
4
randOverRegress_ST(form, dat, alpha = 0.5, beta = 0.9, rel = "auto",
  thr.rel = 0.5, epsilon = 1e-04, C.perc = "balance", repl = TRUE,
  type = "add", site_id = "site_id", time = "time",
  sites_sf = NULL, lon = NULL, lat = NULL, crs = NULL)

Arguments

form

a model formula

dat

the original training set (with the unbalanced distribution)

alpha

weighting parameter for temporal and spatial re-sampling probabilities. Default 0.5

beta

weighting parameter for spatiotemporal weight and phi for re-sampling probabilities. Default 0.9

rel

relevance determined automatically (default) with uba package or provided by the user

thr.rel

relevance threshold above which a case is considered as belonging to the rare "class"

epsilon

minimum weight to be added to all observations. Default 1E-4

C.perc

A vector containing the over-sampling percentage/s to apply to all/each "class" (bump) obtained with the relevance threshold. Replicas of the examples are are randomly added in each "class". If only one percentage is provided this value is reused in all the "classes" that have values above the relevance threshold. A different percentage can be provided to each "class". In this case, the percentages should be provided in ascending order of target variable value. The over-sampling percentage(s), should be numbers above 0, meaning that the important cases (cases above the threshold) are over-sampled by the corresponding percentage. If the number 1 is provided then the number of extreme examples will be doubled. Alternatively, C.perc parameter may be set to "balance" or "extreme", cases where the over-sampling percentages are automatically estimated to either balance or invert the frequencies of the examples in the "classes" (bumps).

repl

allowed to perform sampling with replacement

type

character string indicating the type of bias used. Default is "add". More types to be added in future work

site_id

the name of the column containing location IDs

time

the column name of the time-stamp

sites_sf

An sf obejct containing station and IDs and geometry points of the locations. As an alternative, provide lon, lat, and crs

lon

the name of the column containing the location's longitude

lat

the name of the column containing the location's latitude

crs

the code for the Coordinate Reference System

Value

The function returns a data frame with the new data set resulting from the application of the spatio-temporally biased over-sampling strategy.

References

Paula Branco, Rita P. Ribeiro, Luis Torgo (2016)., UBL: an R Package for Utility-Based Learning, CoRR abs/1604.08079 [cs.MS], URL: http://arxiv.org/abs/1604.08079

See Also

RandOverRegress, sample_wts


mrfoliveira/STResampling-DSAA2019 documentation built on April 9, 2021, 5:39 a.m.