R/data_aggregation.R

Defines functions .get_quote .hf_trades aggregate_trades classify_trades

Documented in aggregate_trades classify_trades

## - | FILE  HEADER |
##
## Script name:
##    data_aggregation.R
##
## Purpose of script:
##    Implement four classification algorithms used to aggregate
##    high frequency data into daily data.
##
## Author:
##    Montasser Ghachem
##
## Last updated:
##    2023-03-20
##
## License:
##    GPL 3
##
## Email:
##    montasser.ghachem@pinstimation.com
##
##
##
## Public functions:
## ++++++++++++++++++
##
## classify_trades():
##    Classify high-frequency trading data using different trade
##    classification algorithms, and time lags.
##
## aggregate_trades():
##    Aggregates high-frequency trading data into aggregated
##    data at the provided frequency using different trade
##    classification algorithms.
##
## ++++++++++++++++++
##
##
## --
## Package: PINstimation
## website: www.pinstimation.com
## Authors: Montasser Ghachem and Oguz Ersan


##       +++++++++++++++++++++++++
## ++++++| |  PUBLIC FUNCTIONS | |
##       +++++++++++++++++++++++++


#' @title Classification and aggregation of high-frequency data
#'
#' @description `classify_trades()` classifies high-frequency trading data into
#' buyer-initiated and seller-initiated trades using different algorithms, and
#' different time lags.
#' \cr `aggregate_trades()` aggregates high-frequency trading data into aggregated
#' data for provided frequency of aggregation. The aggregation is preceded by
#' a trade classification step which classifies trades using different trade
#' classification algorithms and time lags.
#'
#' @param data A dataframe with 4 variables in the following
#' order (`timestamp`, `price`, `bid`, `ask`).
#' @param algorithm A character string refers to the algorithm used
#' to determine the trade initiator, a buyer or a seller. It takes one of four
#' values (`"Tick"`, `"Quote"`, `"LR"`, `"EMO"`). The default value is
#' `"Tick"`. For more information about the different algorithms, check the
#' details section.
#'
#' @param timelag A number referring to the time lag in milliseconds
#' used to calculate the lagged midquote, bid and ask for the algorithms
#' \code{"Quote"}, \code{"EMO"} and \code{"LR"}.
#'
#' @param frequency The frequency used to aggregate intraday data. It takes one
#' of the following values: `"sec"`, `"min"`, `"hour"`, `"day"`, `"week"`,
#' `"month"`. The default value is `"day"`.
#'
#' @param  unit An integer referring to the size of the aggregation window
#' used to aggregate intraday data. The default value is `1`. For example, when
#' the parameter `frequency` is set to `"min"`, and the parameter `unit` is set
#' to 15, then the intraday data is aggregated every 15 minutes.
#'
#' @param ... Additional arguments passed on to the functions `classify_trades()`
#' `aggregate_trades()`. The recognized arguments are `fullreport`,
#' and `is_parallel`. Other arguments will be ignored.
#' \itemize{
#' \item `fullreport` is binary variable passed to `aggregate_trades()` that
#' specifies whether the variable `freq` is returned. The default value is
#' \code{FALSE}.
#' \item `is_parallel` is a logical variable passed to `classify_trades()` that
#' specifies whether the computation is performed using parallel or sequential
#' processing. #' The default value is `TRUE`. For more details, please refer to the
#' vignette 'Parallel processing' in the package, or
#' \href{https://pinstimation.com/articles/parallel_processing.html}{online}.
#' }
#' @param verbose A binary variable that determines whether detailed
#' information about the progress of the trade classification is displayed.
#' No output is produced when \code{verbose} is set to \code{FALSE}. The default
#' value is \code{TRUE}.
#'
#' @details
#' The argument `algorithm` takes one of four values:
#' \itemize{
#'    \item \code{"Tick"} refers to the tick algorithm: Trade is classified as a
#'    buy (sell) if the price of the trade to be classified
#'    is above (below) the closest different price of a previous trade.
#'    \item \code{"Quote"} refers to the quote algorithm: it classifies a
#'    trade as a buy (sell) if the trade price of the trade to be
#'    classified is above (below) the mid-point of the bid and ask spread.
#'    Trades executed at the mid-spread are not classified.
#'    \item \code{"LR"}  refers to `LR` algorithm as in
#'    \insertCite{LeeReady1991;textual}{PINstimation}. It classifies a trade
#'     as a buy (sell) if its price is above (below) the mid-spread (quote
#'     algorithm), and  uses the tick algorithm if the trade price is at
#'     the mid-spread.
#'     \item \code{"EMO"} refers to `EMO` algorithm as in
#'     \insertCite{Ellis2000;textual}{PINstimation}.
#'     It classifies trades at the bid (ask) as sells (buys) and uses the tick
#'     algorithm to classify trades within the then prevailing bid-ask spread.
#'  }
#'
#' `LR` recommend the use of mid-spread five-seconds earlier ('5-second'
#' rule) mitigating trade misclassifications for many of the \code{150}
#' NYSE stocks they analyze. On the other hand, in more recent studies such
#' as \insertCite{piwowar2006;textual}{PINstimation} and
#' \insertCite{Aktas2014;textual}{PINstimation}, the use of
#' 1-second lagged midquotes are shown to yield lower rates of
#' misclassifications. The default value is set to `0` seconds (no time-lag).
#' Considering the ultra-fast nature of today’s financial markets, time-lag
#' is in the unit of milliseconds. Shorter than 1-second lags can also be
#' implemented by entering values such as  `100` or `500`.
#'
#' @return
#' The function classify_trades() returns a dataframe of five variables. The
#' first four variables are obtained from the argument `data`: `timestamp`,
#' `price`, `bid`, `ask`. The fifth variable is `isbuy`, which takes the value
#' `TRUE`, when the trade is classified as a buyer-initiated trade, and `FALSE`
#' when the trade is classified as a seller-initiated trade.
#'
#' The function aggregate_trades() returns a dataframe of two
#' (or three) variables. If \code{fullreport} is set to \code{TRUE}, then
#' the returned dataframe has three variables `{freq, b, s}`. If
#' \code{fullreport} is set to \code{FALSE}, then the returned dataframe has
#' two variables `{b, s}`, and, therefore, can be #'directly used for the
#' estimation of the `PIN` and `MPIN` models.
#'
#' @references
#'
#' \insertAllCited
#'
#' @examples
#' # There is a preloaded dataset called 'hfdata' contained in the package.
#' # It is an artificially created high-frequency trading data. The dataset
#' # contains  100 000 trades and five variables 'timestamp', 'price',
#' # 'volume', 'bid', and 'ask'. For more information, type ?hfdata.
#'
#' xdata <- hfdata
#' xdata$volume <- NULL
#' \donttest{
#' # Use the EMO algorithm with a timelag of 500 milliseconds to classify
#' # high-frequency trades in the dataset 'xdata'
#'
#' ctrades <- classify_trades(xdata, algorithm = "EMO", timelag = 500, verbose = FALSE)
#'
#' # Use the LR algorithm with a timelag of 1 second to aggregate intraday data
#' # in the dataset 'xdata' at a frequency of 15 minutes.
#'
#'
#' lrtrades <- aggregate_trades(xdata, algorithm = "LR", timelag = 1000,
#' frequency = "min", unit = 15, verbose = FALSE)
#'
#' # Use the Quote algorithm with a timelag of 1 second to aggregate intraday data
#' # in the dataset 'xdata' at a daily frequency.
#'
#' qtrades <- aggregate_trades(xdata, algorithm = "Quote", timelag = 1000,
#' frequency = "day", unit = 1, verbose = FALSE)
#'
#' # Since the argument 'fullreport' is set to FALSE by default, then the
#' # output 'qtrades' can be used directly for the estimation of the PIN
#' # model, namely using pin_ea().
#'
#' estimate <- pin_ea(qtrades, verbose = FALSE)
#'
#' # Show the estimate
#'
#' show(estimate)
#' }
#' @name trade_classification
NULL


#' @rdname trade_classification
#' @export
classify_trades <- function(data,
                            algorithm = "Tick",
                            timelag = 0, ...,
                            verbose = TRUE) {

  return(.hf_trades(data, algorithm = algorithm, timelag = timelag, ...,
                    aggregate = FALSE, verbose = verbose))
}



#' @rdname trade_classification
#' @export
aggregate_trades <- function(data,
                             algorithm = "Tick",
                             timelag = 0,
                             frequency = "day",
                             unit = 1,
                             ...,
                             verbose = TRUE) {

  return(.hf_trades(data, algorithm = algorithm, timelag = timelag,
                    frequency = frequency, unit = unit,
                    ..., aggregate = TRUE, verbose = verbose))
}



##       +++++++++++++++++++++++++
## ++++++| | PRIVATE FUNCTIONS | |
##       +++++++++++++++++++++++++


.hf_trades <- function(data, algorithm = "Tick", timelag = 0, frequency = "day", unit = 1,
                       ..., verbose = TRUE) {

  # Check that all variables exist and do not refer to non-existent variables
  # --------------------------------------------------------------------------
  allvars <- names(formals())
  allvars <- allvars[-5]
  environment(.xcheck$existence) <- environment()
  .xcheck$existence(allvars, err = uierrors$aggregation()$fn)

  # Assign the dot-dot-dot arguments
  # --------------------------------------------------------------------------
  fullreport <- .default$fullreport
  is_parallel <- .default$aggregation_parallel
  vargs <- list(...)

  # check for unknown keys in the argument "..."
  unknown <- setdiff(names(vargs), c("fullreport", "is_parallel", "aggregate"))
  ux$stopnow(length(unknown) > 0, s = uierrors$aggregation()$fn,
             m = uierrors$arguments()$unknown(u = unknown))

  if (length(vargs) > 0 && "fullreport" %in% names(vargs))
    fullreport <- vargs$fullreport
  if (length(vargs) > 0 && "is_parallel" %in% names(vargs))
    is_parallel <- vargs$is_parallel
  if (length(vargs) > 0 && "aggregate" %in% names(vargs))
    aggregate <- vargs$aggregate
  vargs <- NULL

  # Check that all arguments are valid
  # -------------------------------------------------------------------------
  largs <- list(data, algorithm, timelag, frequency, unit, 0, verbose)
  names(largs) <- names(formals())
  largs[["..."]] <- NULL
  largs$is_parallel <- is_parallel
  largs$fullreport <- fullreport
  rst <- .xcheck$args(arglist = largs, fn = "aggregation")
  ux$stopnow(rst$off, m = rst$error, s = uierrors$aggregation()$fn)

  # Prepare the dataset
  # --------------------------------------------------------------------------
  # We rename the first four columns to "timestamp", "price", "bid", "ask"
  is_posixct <- function(x) inherits(x, "POSIXct")
  data <- as.data.frame(data[, 1:4])
  colnames(data) <- c("timestamp", "price", "bid", "ask")

  # We convert the columns price, bid and ask to numeric if they are not.
  # We also convert the timestamp variable to type 'PosixCT', if it does not
  # already have that type.

  if (!is_posixct(data$timestamp)) {
    stamps <- tryCatch({
      as.POSIXct(data$timestamp, format = "%Y-%m-%d %H:%M:%OS")
    }, error = function(err) {NA})

    if (sum(is.na(stamps)) == 0) {
      data$timestamp <- stamps
    } else {
      data$timestamp <- as.POSIXlt(
        as.character(strptime(data$timestamp, format = "%H:%M:%S")))
    }
  }

  if (!is.numeric(data$price)) data$price <- as.numeric(data$price)
  if (!is.numeric(data$bid))  data$bid <- as.numeric(data$bid)
  if (!is.numeric(data$ask))  data$ask <- as.numeric(data$ask)

  # Sort the dataset by timestamp
  data <- data[order(data$timestamp), 1:4]
  colnames(data) <- c("timestamp", "price", "bid", "ask")

  # Check for errors in the case of the algorithm, use the algorithm Tick if
  # the algorithm name is unrecognized.
  unrecognized <- !any(toupper(algorithm) %in% c("TICK", "LR", "QUOTE", "EMO"))
  if (missing(algorithm) | unrecognized) algorithm <- "Tick"
  aggregate_ms <- uix$classification(
    nrow(data), method = algorithm, timelag, "", isparallel = is_parallel)

  # Show messages

  ux$show(verbose, m = aggregate_ms$start)
  ux$show(verbose & unrecognized, m = aggregate_ms$unrecognized, warning = TRUE)
  ux$show(verbose, m = aggregate_ms$algorithm)
  ux$show(verbose, m = aggregate_ms$number)
  ux$show(verbose & algorithm != "Tick", m = aggregate_ms$lag)
  ux$show(verbose & algorithm != "Tick", m = aggregate_ms$computing)

  # The vector returned by .get_quote() takes the value 'TRUE' for
  # buys, and 'FALSE' for sells.
  isbuy <- day <- NULL
  data$isbuy <- .get_quote(data, timelag, algorithm, is_parallel, verbose)
  data <- data[!is.na(data$isbuy), ]

  # Get rid of null values
  notnull <- sapply(data$isbuy, function(x) !is.null(x))
  data <- data[notnull, ]


  if (aggregate == TRUE) {

    ux$show(verbose, m = aggregate_ms$aggregating)

    data$bid <- data$ask <- data$price <- NULL
    # Transform the timestamp variable into a variable 'freq' given the provided
    # frequency, then use it to aggregate the number of buys, and the number of
    # sells at this frequency

    # Assign timestamps to time groups - freq
    timefreq <- paste(unit, frequency)

    data$freq <- cut.POSIXt(x = data$timestamp, breaks = timefreq)

    data$b <- as.numeric(data$isbuy)
    data$s <- 1 - data$b
    data$timestamp <- data$isbuy <- NULL
    db <- aggregate(.~freq, data=data, sum)

    if (!fullreport) db$freq <- NULL
    data <- db
  }

  # Show complete message
  ux$show(verbose, m = aggregate_ms$complete)

  return(invisible(data))
}


.get_quote <- function(data, timelag, method, is_parallel, verbose) {
# computes the logical vector quote that takes the value 'TRUE' if buy, and
# FALSE if sell
#
# Args:
#   data    : a dataframe with 4 variables
#   ('timestamp', 'price', 'bid', 'ask')
#   method  : a character string refers to the method used to determine the
#   trade initiator, a buyer or a seller. It takes one of four values
#   (Tick, Quote, LR, EMO).
#   is_parallel : a logical variable specifying whether parallel computing is
#   used.
#   timelag : a timelag in milliseconds used to calculate the lagged midquote
#   verbose : if TRUE, shows the progress of classification process.
#
# Returns:
#   A binary vector identifying the buyer-initiated trades

  aggregate_ms <- uix$classification()

  .get_tick_vector <- function() {

    # Get the sign of the (initial) price difference
    data$diffprice <- sign(c(0, diff(data$price)))

    data$diffprice[is.na(data$diffprice)] <- 0

    # Replace the zeros by the previous non-zero value.
    data$diffprice <- Reduce(
      function(x, y) if (y == 0) x else y, data$diffprice,
                             accumulate = TRUE)

    # Using the (updated) price difference, get the tick value
    # The signs take the values -1, 0, and 1; to get 1, 2, 3; we
    # then add +2
    tradeclass <- function(diffprice)
      return(switch(diffprice + 2, FALSE, NA, TRUE))

    data$buy <- vapply(data$diffprice, tradeclass, logical(1))

    return(invisible(data$buy))
  }

  .get_lagged_value <- function(cindex) {

    if (!exists(".lwbound") || cindex == 1) .lwbound <- 1

    pasttimes <- data$timestamp[.lwbound:cindex]

    currenttime <- data$timestamp[cindex]

    threshold <- currenttime - (timelag / 1000)

    atorbelowthreshold <- .lwbound - 1 + findInterval(threshold, pasttimes)

    atorbelowthreshold <- max(atorbelowthreshold, 0)

    .lwbound <<- atorbelowthreshold

    if (verbose) setTxtProgressBar(pblagged, cindex)

    return(atorbelowthreshold)

  }


  if (method == "Tick") {

    return(.get_tick_vector())

  } else {

    # The method is either "EMO", "LR", or "Quote". They all rely on
    # lagged values if timelag > 0. If timelag == 0, then the lagged
    # values are the values themselves, so they have the same index.
    # If timelag > 0, then the lagged values are computed using
    # the lagged indices.

    laggedindices <- seq_len(nrow(data))
    xs <- seq_len(nrow(data))

    # If timelag > 0, update the lagged indices.
    if (timelag > 0) {

      if (verbose) {
        pblagged <- ux$progressbar(minvalue = 0, maxvalue = nrow(data))
        cat(aggregate_ms$progressbar)

      }

      time_on <- Sys.time()

      if (is_parallel) {

        oplan <- future::plan(multisession, gc = TRUE,
                              workers = .default$parallel_cores())

        on.exit(plan(oplan), add = TRUE)

        laggedindices <- furrr::future_map(xs, .get_lagged_value)

        laggedindices <- unlist(laggedindices)

      } else {

        laggedindices <- vapply(xs, .get_lagged_value, double(1))

      }

      time_off <- Sys.time()

      actualtime <- ux$showtime(ux$timediff(time_on, time_off), full = FALSE)

      aggregate_ms <- uix$classification(time = actualtime)

      if(verbose) cat("\n")
      ux$show(c = verbose, m = aggregate_ms$time)

    }

    zeros <- sum(laggedindices == 0)
    laggedindices <- laggedindices[laggedindices > 0]

    if (method == "EMO") { # EMO ALGORITHM

      data$lbid <- c(rep(NA, zeros), data$bid[laggedindices])
      data$bid <- NULL
      data$lask <- c(rep(NA, zeros), data$ask[laggedindices])
      data$ask <- NULL

      data$quote <- .get_tick_vector()

      at_laggedbid <- which(data$price == data$lbid)
      if (length(at_laggedbid) > 0) data[at_laggedbid, ]$quote <- FALSE
      at_laggedask <- which(data$price == data$lask)
      if (length(at_laggedask) > 0) data[at_laggedask, ]$quote <- TRUE

    } else { # QUOTE OR LR ALGORITHM ALGORITHM

      data$midquote <- (data$bid + data$ask) / 2
      data$bid <- data$ask <- NULL
      data$lmidquote <- c(rep(NA, zeros), data$midquote[laggedindices])

      data$priceminusmidquote <- data$price - data$lmidquote

      # Use tradeclass() to get the class of the trade "BUY", "SELL" or "NONE"
      tradeclass <- function(diffprice)
        return(switch(diffprice + 2, FALSE, NA, TRUE))

      data$quote <- lapply(sign(data$priceminusmidquote), tradeclass)

      # If the algorithm is quote, we stop here. If the algorithm is LR, then
      # there is an additional step

      if (method == "LR") { # LR ALGORITHM

        tick <- .get_tick_vector()

        # Replace the value of trade class by the value of the tick rule when
        # the price is at the midpoint i.e. when quote is equal to NA
        data$quote <- ifelse(data$priceminusmidquote == 0, tick, data$quote)

      }

    }

  }

  return(invisible(data$quote))

}

Try the PINstimation package in your browser

Any scripts or data that you put into this service are public.

PINstimation documentation built on March 31, 2023, 6:32 p.m.