trade_classification: Classification and aggregation of high-frequency data
In monty-se/PINstimation: Estimation of the Probability of Informed Trading

trade_classification

R Documentation

Classification and aggregation of high-frequency data

Description

classify_trades() classifies high-frequency trading data into buyer-initiated and seller-initiated trades using different algorithms, and different time lags.
aggregate_trades() aggregates high-frequency trading data into aggregated data for provided frequency of aggregation. The aggregation is preceded by a trade classification step which classifies trades using different trade classification algorithms and time lags.

Usage

classify_trades(data, algorithm = "Tick", timelag = 0, ..., verbose = TRUE)

aggregate_trades(
  data,
  algorithm = "Tick",
  timelag = 0,
  frequency = "day",
  unit = 1,
  ...,
  verbose = TRUE
)

Arguments

`data`	A dataframe with 4 variables in the following order (`timestamp`, `price`, `bid`, `ask`).
`algorithm`	A character string refers to the algorithm used to determine the trade initiator, a buyer or a seller. It takes one of four values (`"Tick"`, `"Quote"`, `"LR"`, `"EMO"`). The default value is `"Tick"`. For more information about the different algorithms, check the details section.
`timelag`	A number referring to the time lag in milliseconds used to calculate the lagged midquote, bid and ask for the algorithms `"Quote"`, `"EMO"` and `"LR"`.
`...`	Additional arguments passed on to the functions `classify_trades()` `aggregate_trades()`. The recognized arguments are `fullreport`, and `is_parallel`. Other arguments will be ignored. `fullreport` is binary variable passed to `aggregate_trades()` that specifies whether the variable `freq` is returned. The default value is `FALSE`. `is_parallel` is a logical variable passed to `classify_trades()` that specifies whether the computation is performed using parallel or sequential processing. #' The default value is `TRUE`. For more details, please refer to the vignette 'Parallel processing' in the package, or online.
`verbose`	A binary variable that determines whether detailed information about the progress of the trade classification is displayed. No output is produced when `verbose` is set to `FALSE`. The default value is `TRUE`.
`frequency`	The frequency used to aggregate intraday data. It takes one of the following values: `"sec"`, `"min"`, `"hour"`, `"day"`, `"week"`, `"month"`. The default value is `"day"`.
`unit`	An integer referring to the size of the aggregation window used to aggregate intraday data. The default value is `1`. For example, when the parameter `frequency` is set to `"min"`, and the parameter `unit` is set to 15, then the intraday data is aggregated every 15 minutes.

Details

The argument algorithm takes one of four values:

"Tick" refers to the tick algorithm: Trade is classified as a buy (sell) if the price of the trade to be classified is above (below) the closest different price of a previous trade.
"Quote" refers to the quote algorithm: it classifies a trade as a buy (sell) if the trade price of the trade to be classified is above (below) the mid-point of the bid and ask spread. Trades executed at the mid-spread are not classified.
"LR" refers to LR algorithm as in \insertCiteLeeReady1991;textualPINstimation. It classifies a trade as a buy (sell) if its price is above (below) the mid-spread (quote algorithm), and uses the tick algorithm if the trade price is at the mid-spread.
"EMO" refers to EMO algorithm as in \insertCiteEllis2000;textualPINstimation. It classifies trades at the bid (ask) as sells (buys) and uses the tick algorithm to classify trades within the then prevailing bid-ask spread.

LR recommend the use of mid-spread five-seconds earlier ('5-second' rule) mitigating trade misclassifications for many of the 150 NYSE stocks they analyze. On the other hand, in more recent studies such as \insertCitepiwowar2006;textualPINstimation and \insertCiteAktas2014;textualPINstimation, the use of 1-second lagged midquotes are shown to yield lower rates of misclassifications. The default value is set to 0 seconds (no time-lag). Considering the ultra-fast nature of today’s financial markets, time-lag is in the unit of milliseconds. Shorter than 1-second lags can also be implemented by entering values such as 100 or 500.

Value

The function classify_trades() returns a dataframe of five variables. The first four variables are obtained from the argument data: timestamp, price, bid, ask. The fifth variable is isbuy, which takes the value TRUE, when the trade is classified as a buyer-initiated trade, and FALSE when the trade is classified as a seller-initiated trade.

The function aggregate_trades() returns a dataframe of two (or three) variables. If fullreport is set to TRUE, then the returned dataframe has three variables ⁠{freq, b, s}⁠. If fullreport is set to FALSE, then the returned dataframe has two variables ⁠{b, s}⁠, and, therefore, can be #'directly used for the estimation of the PIN and MPIN models.

References

\insertAllCited

Examples

# There is a preloaded dataset called 'hfdata' contained in the package.
# It is an artificially created high-frequency trading data. The dataset
# contains  100 000 trades and five variables 'timestamp', 'price',
# 'volume', 'bid', and 'ask'. For more information, type ?hfdata.

xdata <- hfdata
xdata$volume <- NULL

# Use the EMO algorithm with a timelag of 500 milliseconds to classify
# high-frequency trades in the dataset 'xdata'

ctrades <- classify_trades(xdata, algorithm = "EMO", timelag = 500, verbose = FALSE)

# Use the LR algorithm with a timelag of 1 second to aggregate intraday data
# in the dataset 'xdata' at a frequency of 15 minutes.


lrtrades <- aggregate_trades(xdata, algorithm = "LR", timelag = 1000,
frequency = "min", unit = 15, verbose = FALSE)

# Use the Quote algorithm with a timelag of 1 second to aggregate intraday data
# in the dataset 'xdata' at a daily frequency.

qtrades <- aggregate_trades(xdata, algorithm = "Quote", timelag = 1000,
frequency = "day", unit = 1, verbose = FALSE)

# Since the argument 'fullreport' is set to FALSE by default, then the
# output 'qtrades' can be used directly for the estimation of the PIN
# model, namely using pin_ea().

estimate <- pin_ea(qtrades, verbose = FALSE)

# Show the estimate

show(estimate)

monty-se/PINstimation documentation built on Oct. 22, 2024, 8:04 p.m.