madlib.arima: Wrapper for MADlib's ARIMA model fitting function

Description Usage Arguments Details Value Author(s) References See Also Examples

Description

Apply ARIM model fitting onto a table that contains time series data. The table must have two columns: one for the time series values, and the other for the time stamps. The time stamp can be anything that can be ordered. This is because the rows of a table does not have inherent order and thus needs to be ordered by the extra time stamp column.

Usage

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
## S4 method for signature 'db.Rquery,db.Rquery'
madlib.arima(x, ts, by = NULL,
order=c(1,1,1), seasonal = list(order = c(0,0,0), period = NA),
include.mean = TRUE, method = "CSS", optim.method = "LM",
optim.control = list(), ...)

## S4 method for signature 'formula,db.obj'
madlib.arima(x, ts, order=c(1,1,1),
seasonal = list(order = c(0,0,0), period = NA), include.mean = TRUE,
method = "CSS", optim.method = "LM", optim.control = list(), ...)

Arguments

x

A formula with the format of time series value ~ time stamp | grouping col_1 + ... + grouping col_n. Or a db.Rquery object, which is the time series value. Grouping is not implemented yet. Both time stamp and time series can be valid expressions.

We must specify the time stamp because the table in database has no order of rows, and we have to order they according the given time stamps.

ts

If x is a formula object, this must be a db.obj object, which contains both the time series and time stamp columns. If x is a db.Rquery object, this must be another db.Rquery object, which is the time stamp and can be a valid expression.

by

A list of db.Rquery, the default is NULL. The grouping columns. Right now, this functionality is not implemented yet.

order

A vector of 3 integers, default is c(1,1,1). The ARIMA orders p, d, q for AR, I and MA.

seasonal

A list of order and perid, default is list(order = c(0,0,0), period = NA). The seasonal orders and period. Currently not implemented.

include.mean

A logical value, default is TRUE. Whether to estimate the mean value of the time series. If the integration order d (the second element of order) is not zero, include.mean is set to FALSE in the calculation.

method

A string, the fitting method. The default is "CSS", which uses conditional-sum-of-squares to fit the time series. Right now, only "CSS" is supported.

optim.method

A string, the optimization method. The default is "LM", the Levenberg-Marquardt algorithm. Right now, only "LM" is supported.

optim.control

A list, default is list(). The control parameters of the optimizer. For optim.method="LM", it can have the following optional parameters:

- max_iter: Maximum number of iterations to run learning algorithm (Default = 100)

- tau: Computes the initial step size for gradient algorithm (Default = 0.001)

- e1: Algorithm-specific threshold for convergence (Default = 1e-15)

- e2: Algorithm-specific threshold for convergence (Default = 1e-15)

- e3: Algorithm-specific threshold for convergence (Default = 1e-15)

- hessian_delta: Delta parameter to compute a numerical approximation of the Hessian matrix (Default = 1e-6)

...

Other optional parameters. Not implemented.

Details

Given a time series of data X, the Autoregressive Integrated Moving Average (ARIMA) model is a tool for understanding and, perhaps, predicting future values in the series. The model consists of three parts, an autoregressive (AR) part, a moving average (MA) part, and an integrated (I) part where an initial differencing step can be applied to remove any non-stationarity in the signal. The model is generally referred to as an ARIMA(p, d, q) model where parameters p, d, and q are non-negative integers that refer to the order of the autoregressive, integrated, and moving average parts of the model respectively.

MADlib's ARIMA function implements a parallel version of the LM algorithm to maximize the conditional log-likelihood, which is suitable for big data.

Value

Returns an arima.css.madlib object, which is a list that contains the following items:

coef

A vector of double values. The fitting coefficients of AR, MA and mean value (if include.mean is TRUE).

s.e.

A vector of double values. The standard errors of the fitting coefficients.

series

A string, the data source table or SQL query.

time.stamp

A string, the name of the time stamp column.

time.series

A string, the name of the time series column.

sigma2

the MLE of the innovations variance.

loglik

the maximized conditional log-likelihood (of the differenced data).

iter.num

An integer, how many iterations of the LM algorithm is used to fit the time series with ARIMA model.

exec.time

The time spent on the MADlib ARIMA fitting.

residuals

A db.data.frame object that points to the table that contains all the fitted innovations.

model

A db.data.frame object that points to the table that contains the coefficients and standard error. This table is needed by predict.arima.css.madlib.

statistics

A db.data.frame object that points to the table that contains information including log-likelihood, sigma^2 etc. This table is needed by predict.arima.css.madlib.

call

A language object. The matched function call.

Author(s)

Author: Predictive Analytics Team at Pivotal Inc.

Maintainer: Frank McQuillan, Pivotal Inc. [email protected]

References

[1] Rob J Hyndman and George Athanasopoulos: Forecasting: principles and practice, http://otexts.com/fpp/

[2] Robert H. Shumway, David S. Stoffer: Time Series Analysis and Its Applications With R Examples, Third edition Springer Texts in Statistics, 2010

[3] Henri Gavin: The Levenberg-Marquardt method for nonlinear least squares curve-fitting problems, 2011

See Also

madlib.lm, madlib.glm, madlib.summary are MADlib wrapper functions.

delete deletes the result of this function together with the model, residual and statistics tables.

print.arima.css.madlib, show.arima.css.madlib and summary.arima.css.madlib prints the result in a pretty format.

predict.arima.css.madlib makes forecast of the time series based upon the result of this function.

Examples

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
## Not run: 
library(PivotalR)


## set up the database connection
## Assume that .port is port number and .dbname is the database name
cid <- db.connect(port = .port, dbname = .dbname, verbose = FALSE)

## use double values as the time stamp
## Any values that can be ordered will work
example_time_series <- data.frame(id =
                       seq(0,1000,length.out=length(ts)),
                       val = arima.sim(list(order=c(2,0,1), ar=c(0.7,
                             -0.3), ma=0.2), n=1000000) + 3.2)

x <- as.db.data.frame(example_time_series, field.types = list(id="double
     precision", val = "double precision"), conn.id = cid)

dim(x)

names(x)

## use formula
s <- madlib.arima(val ~ id, x, order = c(2,0,1))

s

## delete s and the 3 tables: model, residuals and statistics
delete(s)

s # s does not exist any more

## do not use formula
s <- madlib.arima(x$val, x$id, order = c(2,0,1))

s

lookat(sort(s$residuals, F, s$residuals$tstamp), 10)

lookat(s$model)

lookat(s$statistics)

## 10 forecasts
pred <- predict(s, n.ahead = 10)

lookat(sort(pred, F, pred$step_ahead), "all")

## Use expressions
s <- madlib.arima(val+2 ~ I(id + 1), x, order = c(2,0,1))

db.disconnect(cid, verbose = FALSE)

## End(Not run)

PivotalR documentation built on May 30, 2017, 8:18 a.m.