madlib.arima: Wrapper for MADlib's ARIMA model fitting function
In PivotalR: A Fast, Easy-to-Use Tool for Manipulating Tables in Databases and a Wrapper of MADlib

Description Usage Arguments Details Value Author(s) References See Also Examples

Apply ARIM model fitting onto a table that contains time series data. The table must have two columns: one for the time series values, and the other for the time stamps. The time stamp can be anything that can be ordered. This is because the rows of a table does not have inherent order and thus needs to be ordered by the extra time stamp column.

## S4 method for signature 'db.Rquery,db.Rquery'
madlib.arima(x, ts, by = NULL,
order=c(1,1,1), seasonal = list(order = c(0,0,0), period = NA),
include.mean = TRUE, method = "CSS", optim.method = "LM",
optim.control = list(), ...)

## S4 method for signature 'formula,db.obj'
madlib.arima(x, ts, order=c(1,1,1),
seasonal = list(order = c(0,0,0), period = NA), include.mean = TRUE,
method = "CSS", optim.method = "LM", optim.control = list(), ...)

`x`	A formula with the format of `time series value` `~` `time stamp` `\|` `grouping col_1 + ... + grouping col_n`. Or a `db.Rquery` object, which is the `time series value`. Grouping is not implemented yet. Both time stamp and time series can be valid expressions. We must specify the time stamp because the table in database has no order of rows, and we have to order they according the given time stamps.
`ts`	If `x` is a formula object, this must be a `db.obj` object, which contains both the time series and time stamp columns. If `x` is a `db.Rquery` object, this must be another `db.Rquery` object, which is the `time stamp` and can be a valid expression.
`by`	A list of `db.Rquery`, the default is NULL. The grouping columns. Right now, this functionality is not implemented yet.
`order`	A vector of 3 integers, default is `c(1,1,1)`. The ARIMA orders `p, d, q` for AR, I and MA.
`seasonal`	A list of `order` and `perid`, default is `list(order = c(0,0,0), period = NA)`. The seasonal orders and period. Currently not implemented.
`include.mean`	A logical value, default is `TRUE`. Whether to estimate the mean value of the time series. If the integration order `d` (the second element of order) is not zero, `include.mean` is set to `FALSE` in the calculation.
`method`	A string, the fitting method. The default is "CSS", which uses conditional-sum-of-squares to fit the time series. Right now, only "CSS" is supported.
`optim.method`	A string, the optimization method. The default is "LM", the Levenberg-Marquardt algorithm. Right now, only "LM" is supported.
`optim.control`	A list, default is `list()`. The control parameters of the optimizer. For `optim.method="LM"`, it can have the following optional parameters: - max_iter: Maximum number of iterations to run learning algorithm (Default = 100) - tau: Computes the initial step size for gradient algorithm (Default = 0.001) - e1: Algorithm-specific threshold for convergence (Default = 1e-15) - e2: Algorithm-specific threshold for convergence (Default = 1e-15) - e3: Algorithm-specific threshold for convergence (Default = 1e-15) - hessian_delta: Delta parameter to compute a numerical approximation of the Hessian matrix (Default = 1e-6)
`...`	Other optional parameters. Not implemented.

Given a time series of data X, the Autoregressive Integrated Moving Average (ARIMA) model is a tool for understanding and, perhaps, predicting future values in the series. The model consists of three parts, an autoregressive (AR) part, a moving average (MA) part, and an integrated (I) part where an initial differencing step can be applied to remove any non-stationarity in the signal. The model is generally referred to as an ARIMA(p, d, q) model where parameters p, d, and q are non-negative integers that refer to the order of the autoregressive, integrated, and moving average parts of the model respectively.

MADlib's ARIMA function implements a parallel version of the LM algorithm to maximize the conditional log-likelihood, which is suitable for big data.

Returns an arima.css.madlib object, which is a list that contains the following items:

`coef`	A vector of double values. The fitting coefficients of AR, MA and mean value (if `include.mean` is `TRUE`).
`s.e.`	A vector of double values. The standard errors of the fitting coefficients.
`series`	A string, the data source table or SQL query.
`time.stamp`	A string, the name of the time stamp column.
`time.series`	A string, the name of the time series column.
`sigma2`	the MLE of the innovations variance.
`loglik`	the maximized conditional log-likelihood (of the differenced data).
`iter.num`	An integer, how many iterations of the LM algorithm is used to fit the time series with ARIMA model.
`exec.time`	The time spent on the MADlib ARIMA fitting.
`residuals`	A `db.data.frame` object that points to the table that contains all the fitted innovations.
`model`	A `db.data.frame` object that points to the table that contains the coefficients and standard error. This table is needed by `predict.arima.css.madlib`.
`statistics`	A `db.data.frame` object that points to the table that contains information including log-likelihood, sigma^2 etc. This table is needed by `predict.arima.css.madlib`.
`call`	A language object. The matched function call.

Author: Predictive Analytics Team at Pivotal Inc.

Maintainer: Frank McQuillan, Pivotal Inc. fmcquillan@pivotal.io

[1] Rob J Hyndman and George Athanasopoulos: Forecasting: principles and practice, https://otexts.com/fpp/

[2] Robert H. Shumway, David S. Stoffer: Time Series Analysis and Its Applications With R Examples, Third edition Springer Texts in Statistics, 2010

[3] Henri Gavin: The Levenberg-Marquardt method for nonlinear least squares curve-fitting problems, 2011

madlib.lm, madlib.glm, madlib.summary are MADlib wrapper functions.

delete deletes the result of this function together with the model, residual and statistics tables.

print.arima.css.madlib, show.arima.css.madlib and summary.arima.css.madlib prints the result in a pretty format.

predict.arima.css.madlib makes forecast of the time series based upon the result of this function.

## Not run: 
library(PivotalR)


## set up the database connection
## Assume that .port is port number and .dbname is the database name
cid <- db.connect(port = .port, dbname = .dbname, verbose = FALSE)

## use double values as the time stamp
## Any values that can be ordered will work
example_time_series <- data.frame(id =
                       seq(0,1000,length.out=length(ts)),
                       val = arima.sim(list(order=c(2,0,1), ar=c(0.7,
                             -0.3), ma=0.2), n=1000000) + 3.2)

x <- as.db.data.frame(example_time_series, field.types = list(id="double
     precision", val = "double precision"), conn.id = cid)

dim(x)

names(x)

## use formula
s <- madlib.arima(val ~ id, x, order = c(2,0,1))

s

## delete s and the 3 tables: model, residuals and statistics
delete(s)

s # s does not exist any more

## do not use formula
s <- madlib.arima(x$val, x$id, order = c(2,0,1))

s

lookat(sort(s$residuals, F, s$residuals$tstamp), 10)

lookat(s$model)

lookat(s$statistics)

## 10 forecasts
pred <- predict(s, n.ahead = 10)

lookat(sort(pred, F, pred$step_ahead), "all")

## Use expressions
s <- madlib.arima(val+2 ~ I(id + 1), x, order = c(2,0,1))

db.disconnect(cid, verbose = FALSE)

## End(Not run)