knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)
#library(QRMon)
devtools::load_all()

Version 0.4

Introduction

In this document we describe the design and implementation of a (software programming) monad for Quantile Regression workflows specification and execution. The implementation is done in R with package QRMon, [AAp1].

Remarks:

What is Quantile Regression? : Assume we have a set of two dimensional points each point being a pair of an independent variable value and a dependent variable value. We want to find a curve that is a function of the independent variable that splits the points in such a way that, say, 30% of the points are above that curve. This is done with Quantile Regression, see [Wk2, RK2, CN1, AA2, AA3]. Quantile Regression is a method to estimate the variable relations for all parts of the distribution. (Not just, say, the mean of the relationships found with Least Squares Regression.)

The goal of the monad design is to make the specification of Quantile Regression workflows (relatively) easy, straightforward, by following a certain main scenario and specifying variations over that scenario. Since Quantile Regression is often compared with Least Squares Regression and some type of filtering (like, Moving Average) those functionalities should be included in the monad design scenarios. (Currently Least Squares Regression and moving averages are not implemented in the R package; they are implemented in the Mathematica one, [AAp2].)

The monad is named QRMon and it utilizes (it is based on) the package magrittr and the Quantile Regression package quantreg, [RKp1, RK1].

The data for this document is provided by the package .

The monadic programming design is used as a Software Design Pattern. The QRMon monad can be also seen as a Domain Specific Language (DSL) for the specification and programming of machine learning classification workflows.

Here is an example of using the QRMon monad over heteroscedastic data::

qrmon <-
  QRMonUnit( setNames( dfDistributionData, c("Regressor", "Value") ) ) %>%
  QRMonEchoDataSummary() %>%
  QRMonQuantileRegression( df = 6 ) %>%
  QRMonPlot( dataPointsColor = "gray70", datePlotQ = TRUE, dateOrigin = "1900-01-01" )

As it was mentioned above the monad QRMon can be seen as a DSL. Because of this the monad pipelines made with QRMon are sometimes called "specifications".

Remark: With "regression quantile" we mean "a curve or function that is computed with Quantile Regression".

Design considerations

The steps of the main regression workflow addressed in this document follow.

  1. Retrieving data from a data repository.

  2. Optionally, transform the data.

    1. Delete rows with missing fields.

    2. Rescale data along one or both of the axes.

    3. Apply moving average (or median, or map.)

  3. Verify assumptions of the data.

  4. Run a regression algorithm with a certain basis of functions using:

    1. Quantile Regression, or

    2. Least Squares Regression.

  5. Visualize the data and regression functions.

  6. If the regression functions fit is not satisfactory go to step 4.

  7. Utilize the found regression functions to compute:

    1. outliers,

    2. local extrema,

    3. approximation or fitting errors,

    4. conditional density distributions,

    5. time series simulations.

The following flow-chart corresponds to the list of steps above.

Quantile-regression-workflow

In order to address:

it is beneficial to have a DSL for regression workflows. We choose to make such a DSL through a functional programming monad, [Wk1, AA1].

Here is a quote from [Wk1] that fairly well describes why we choose to make a classification workflow monad and hints on the desired properties of such a monad.

[...] The monad represents computations with a sequential structure: a monad defines what it means to chain operations together. This enables the programmer to build pipelines that process data in a series of steps (i.e. a series of actions applied to the data), in which each action is decorated with the additional processing rules provided by the monad. [...] Monads allow a programming style where programs are written by putting together highly composable parts, combining in flexible ways the possible actions that can work on a particular type of data. [...]

Remark: Note that quote from [Wk1] refers to chained monadic operations as "pipelines". We use the terms "monad pipeline" and "pipeline" below.

Detect outliers

qrmon <-
  QRMonUnit( dfTemperatureData ) %>%
  QRMonQuantileRegression( df = 16, degree = 3, probabilities = c(0.01,0.99) ) %>%
  QRMonOutliers() %>%
  QRMonOutliersPlot( datePlotQ = TRUE )
res <- qrmon %>% QRMonTakeOutliers()
names(res)
res[["topOutliers"]] %>% dplyr::mutate( Regressor = as.POSIXct(Regressor, origin="1900-01-01"))
res[["bottomOutliers"]] %>% dplyr::mutate( Regressor = as.POSIXct(Regressor, origin="1900-01-01"))

Dependent variable simulation (simulate weather data)

Consider the problem of making a time series that is a simulation of a process given with a known time series.

More formally,

The formulation of the problem hints to an (almost) straightforward implementation using Quantile Regression.

qrmon <-
  QRMonUnit( dfTemperatureData ) %>% 
  QRMonSetRegressionObjects(NULL) %>% 
  QRMonQuantileRegression( df = 12, degree = 3, probabilities = c(0.01, 1:9/10, 0.99) ) %>% 
  QRMonPlot(dataPointsColor = "gray70", datePlotQ = TRUE, dateOrigin = "1900-01-01")

Plot original and simulated data:

set.seed(2223)
qDF <- rbind( cbind( Type = "Original", qrmon %>% QRMonTakeData() ),
              cbind( Type = "Simulated.1", as.data.frame( qrmon %>% QRMonSimulate(1000) %>% QRMonTakeValue() )),
              cbind( Type = "Simulated.2", as.data.frame( qrmon %>% QRMonSimulate(1000) %>% QRMonTakeValue() )),
              cbind( Type = "Simulated.3", as.data.frame( qrmon %>% QRMonSimulate(1000) %>% QRMonTakeValue() ))
              )
ggplot( qDF ) +
  geom_line( aes( x = as.POSIXct(Regressor, origin = "1900-01-01"), y = Value ), color = "lightblue" ) +
  xlab("Time") + ylab("Temperature") +
  facet_wrap( ~Type, ncol=1)

Conditional CDF

With the fitted regression quantiles we can compute the conditional CDF for a given regressor value.

Here we select a random point and get the corresponding conditional CDF function:

set.seed(7493)
resCDF <-
  qrmon %>% 
  QRMonConditionalCDF( sample(dfTemperatureData$Time,1) ) %>% 
  QRMonTakeValue

Here we plot the CDF:

temps <- seq(0,40,0.5)
qDF <- data.frame( Temperature = temps, Probability = purrr::map_dbl( temps, resCDF), stringsAsFactors = F)
qDF <- qDF[complete.cases(qDF), ]
ggplot(qDF) +
  geom_line( aes( x = Temperature, y = Probability ) ) +
  labs( title = element_text( paste( "CDF for time point:", as.POSIXct(as.numeric(names(resCDF)), origin = "1900-01-01") ) ) )

References

Packages

[AAp1] Anton Antonov, Quantile Regression workflows monad in R, (2018), QRMon-R at GitHub.

[AAp2] Anton Antonov, Monadic Quantile Regression Mathematica package, (2018), MathematicaForPrediction at GitHub.

[RKp1] Roger Koenker et al., "quantreg: Quantile Regression", (2018).

MathematicaForPrediction articles

[AA1] Anton Antonov, "Monad code generation and extension", (2017), MathematicaForPrediction at GitHub.

[AA2] Anton Antonov, "Quantile regression through linear programming", (2013), MathematicaForPrediction at WordPress.

[AA3] Anton Antonov, "Quantile regression with B-splines", (2014), MathematicaForPrediction at WordPress.

[AA4] Anton Antonov, "Estimation of conditional density distributions", (2014), MathematicaForPrediction at WordPress.

[AA5] Anton Antonov, "Finding local extrema in noisy data using Quantile Regression", (2015), MathematicaForPrediction at WordPress.

[AA6] Anton Antonov, A monad for Quantile Regression workflows, (2018), MathematicaForPrediction at WordPress.

Other

[Wk1] Wikipedia entry, Monad.

[Wk2] Wikipedia entry, Quantile Regression/

[Wk3] Wikipedia entry, Chebyshev polynomials/

[CN1] Brian S. Code and Barry R. Noon, "A gentle introduction to quantile regression for ecologists", (2003). Frontiers in Ecology and the Environment. 1 (8): 412-420. doi:10.2307/3868138.

[RK1] Roger Koenker, "Quantile Regression in R: a vignette", (2018).

[RK2] Roger Koenker, Quantile Regression, *Cambridge University Press, 2005,.



antononcube/QRMon-R documentation built on July 26, 2021, 1:07 p.m.