run_reg: Perform regressions on the streaming _time series_ of...
In algoquant/HighFreq: High Frequency Time Series Management

run_reg

R Documentation

Perform regressions on the streaming time series of response and predictor data, and calculate the regression coefficients, the residuals, and the forecasts, using online recursive formulas.

Description

Perform regressions on the streaming time series of response and predictor data, and calculate the regression coefficients, the residuals, and the forecasts, using online recursive formulas.

Usage

run_reg(respv, predm, lambdaf, controll)

Arguments

`respv`	A single-column time series or a single-column matrix of response data.
`predm`	A time series or a matrix of predictor data.
`lambdaf`	A decay factor which multiplies past estimates.
`controll`	A list of model parameters (see Details).

Details

The function run_reg() performs regressions on the streaming time series of response r_t and predictor p_t data:

r_t = \beta_t p_t + \epsilon_t

Where \beta_t are the trailing regression coefficients and \epsilon_t are the residuals.

It recursively updates the covariance matrix {cov}_t between the response and the predictor data, and the covariance matrix {cov}_{pt} between the predictors, using the decay factor \lambda:

{cov}_t = \lambda {cov}_{t-1} + (1 - \lambda) r^T_t p_t

{cov}_{p t} = \lambda {cov}_{p (t-1)} + (1 - \lambda) p^T_t p_t

It calculates the regression coefficients \beta_t as equal to the covariance matrix between the response and the predictor data {cov}_t, divided by the covariance matrix between the predictors {cov}_{pt}:

\beta_t = {cov}_t \, {cov}^{-1}_{p t}

It calculates the residuals \epsilon_t as the difference between the response r_t minus the fitted values \beta_t p_t:

\epsilon_t = r_t - \beta_t p_t

And the residual variance \sigma^2_t as:

\bar{\epsilon}_t = \lambda \bar{\epsilon}_{t-1} + (1 - \lambda) \epsilon_t

\sigma^2_t = \lambda^2 \sigma^2_{t-1} + (1 - \lambda^2) (\epsilon_t - \bar{\epsilon}_t)^2

Note that the variance decays as the square of \lambda, while the mean residuals decay as \lambda. This is because the variance is proportional to the square of the residuals.

It then calculates the regression forecasts f_t, as equal to the past regression coefficients \beta_{t-1} times the current predictor data p_t:

f_t = \beta_{t-1} p_t

It finally calculates the forecast errors as the difference between the response minus the regression forecasts: r_t - f_t.

The coefficient matrix \beta and the residuals \epsilon have the same number of rows as the predictor argument predm.

The function run_reg() accepts a list of regression model parameters through the argument controll. The argument controll contains the parameters regmod and residscale. Below is a description of how these parameters work. The list of model parameters can be created using the function param_reg().

The number of regression coefficients is equal to the number of columns of the predictor matrix n. If the predictor matrix contains a unit intercept column then the first regression coefficient is equal to the alpha value \alpha.

If regmod = "least_squares" (the default) then it performs the standard least squares regression. This is currently the only option.

The residuals and the the forecast errors may be scaled by their volatilities to obtain the z-scores. The default is residscale = "none" - no scaling. If the argument residscale = "scale" then the residuals \epsilon_t are divided by their volatilities \sigma_t without subtracting their means:

\epsilon_t = \frac{\epsilon_t}{\sigma_t}

If the argument residscale = "standardize" then the residual means \bar{\epsilon} are subtracted from the residuals, and then they are divided by their volatilities \sigma_t:

\epsilon_t = \frac{\epsilon_t - \bar{\epsilon}}{\sigma_t}

Which are equal to the z-scores.

The forecast errors are also scaled in the same way as the residuals, according to the argumentresidscale.

The above online recursive formulas are convenient for processing live streaming data because they don't require maintaining a buffer of past data. The above recursive formulas are equivalent to a convolution with exponentially decaying weights, but they're much faster to calculate. Using exponentially decaying weights is more natural than using a sliding look-back interval, because it gradually "forgets" about the past data.

The value of the decay factor \lambda must be in the range between 0 and 1. If \lambda is close to 1 then the decay is weak and past values have a greater weight, so the trailing values have a greater dependence on past data. This is equivalent to a long look-back interval. If \lambda is much less than 1 then the decay is strong and past values have a smaller weight, so the trailing values have a weaker dependence on past data. This is equivalent to a short look-back interval.

The function run_reg() returns multiple columns of data, with the same number of rows as the predictor matrix predm. If the predictor matrix predm has n columns then run_reg() returns a matrix with n+2 columns. The first n columns contain the regression coefficients (with the first column equal to the alpha value \alpha). The last 2 columns are the regression residuals and the forecast errors.

Value

A matrix with the regression coefficients, the residuals, and the forecasts (in that order - see details), with the same number of rows as the predictor argument predm.

Examples

## Not run: 
# Calculate historical returns
retp <- na.omit(rutils::etfenv$returns[, c("XLF", "VTI", "IEF")])
# Response equals XLF returns
respv <- retp[, 1]
# Predictor matrix equals VTI and IEF returns
predm <- retp[, -1]
# Add unit intercept column to the predictor matrix
predm <- cbind(rep(1, NROW(predm)), predm)
# Calculate the trailing regressions
lambdaf <- 0.9 # Decay factor
# Create a list of regression parameters
controll <- HighFreq::param_reg(residscale="standardize")
regs <- HighFreq::run_reg(respv=respv, predm=predm, lambdaf=lambdaf, controll=controll)
# Plot the trailing residuals
datav <- cbind(cumsum(respv), regs[, NCOL(regs)])
colnames(datav) <- c("XLF", "residuals")
colnamev <- colnames(datav)
dygraphs::dygraph(datav["2008/2009"], main="Residuals of XLF Versus VTI and IEF") %>%
  dyAxis("y", label=colnamev[1], independentTicks=TRUE) %>%
  dyAxis("y2", label=colnamev[2], independentTicks=TRUE) %>%
  dySeries(axis="y", strokeWidth=2, col="blue") %>%
  dySeries(axis="y2", strokeWidth=2, col="red") %>%
  dyLegend(show="always", width=300)

# Calculate the trailing regressions using R code
lambda1 <- (1-lambdaf)
respv <- zoo::coredata(respv)
predm <- zoo::coredata(predm)
nrows <- NROW(predm)
ncols <- NCOL(predm)
covrespred <- respv[1, ]*predm[1, ]
covpred <- outer(predm[1, ], predm[1, ])
betas <- matrix(numeric(nrows*ncols), nc=ncols)
betas[1, ] <- covrespred %*% MASS::ginv(covpred)
resids <- numeric(nrows)
residm <- 0
residv <- 0
for (it in 2:nrows) {
 covrespred <- lambdaf*covrespred + lambda1*respv[it, ]*predm[it, ]
 covpred <- lambdaf*covpred + lambda1*outer(predm[it, ], predm[it, ])
 betas[it, ] <- covrespred %*% MASS::ginv(covpred)
 resids[it] <- respv[it, ] - (betas[it, ] %*% predm[it, ])
 residm <- lambdaf*residm + lambda1*resids[it]
 residv <- lambdaf*residv + lambda1*(resids[it] - residm)^2
 resids[it] <- (resids[it] - residm)/sqrt(residv)
} # end for
# Compare values, excluding warmup period
all.equal(regs[-(1:1e3), ], cbind(betas, resids)[-(1:1e3), ], check.attributes=FALSE)
 

## End(Not run)  # end dontrun

algoquant/HighFreq documentation built on June 10, 2025, 3:54 p.m.