| trade_classification | R Documentation |
classify_trades() classifies high-frequency trading data into
buyer-initiated and seller-initiated trades using different algorithms, and
different time lags (or leads).
aggregate_trades() aggregates high-frequency trading data into
aggregated data for provided frequency of aggregation. The aggregation is
preceded by a trade classification step which classifies trades using
different trade classification algorithms and time lags (or leads).
classify_trades(data, algorithm = "Tick", timelag = 0, ..., verbose = TRUE)
aggregate_trades(
data,
algorithm = "Tick",
timelag = 0,
frequency = "day",
unit = 1,
...,
verbose = TRUE
)
data |
A dataframe with 4 variables in the following
order ( |
algorithm |
A character string refers to the algorithm used
to determine the trade initiator, a buyer or a seller. It takes one of four
values ( |
timelag |
Numeric scalar. Time offset in microseconds used to select
the quote matched to each trade for the Examples: |
... |
Additional arguments passed to the functions
|
verbose |
A binary variable that determines whether detailed
information about the progress of the trade classification is displayed.
No output is produced when |
frequency |
The frequency used to aggregate intraday data. It takes one
of the following values: |
unit |
An integer referring to the size of the aggregation window
used to aggregate intraday data. The default value is |
Trade classification algorithms
The argument algorithm takes one of four values:
"Tick" refers to the tick algorithm: Trade is classified as a
buy (sell) if the price of the trade to be classified
is above (below) the closest different price of a previous trade.
"Quote" refers to the quote algorithm: it classifies a
trade as a buy (sell) if the trade price of the trade to be
classified is above (below) the mid-point of the bid and ask spread.
Trades executed at the mid-spread are not classified.
"LR" refers to LR algorithm as in
\insertCiteLeeReady1991;textualPINstimation. It classifies a trade
as a buy (sell) if its price is above (below) the mid-spread (quote
algorithm), and uses the tick algorithm if the trade price is at
the mid-spread.
"EMO" refers to EMO algorithm as in
\insertCiteEllis2000;textualPINstimation.
It classifies trades at the bid (ask) as sells (buys) and uses the tick
algorithm to classify trades within the then prevailing bid-ask spread.
Time lags vs. leads (timelag)
For the "Quote", "LR" and "EMO" algorithms, classification relies on a
quote (bid, ask or midquote) matched to each trade. The argument timelag
controls when that quote is taken relative to the trade time:
Positive lags (timelag > 0): for a trade at time t, the
algorithm uses the quote corresponding to the last trade observed
at or before t - |timelag| seconds. If no such past trade exists,
the trade has no matched quote.
Zero lag (timelag = 0): for a trade at time t, the algorithm
uses the quote attached to that trade itself, which in the data setup
corresponds to the bid–ask spread just before the trade is executed.
Negative lags / leads (timelag < 0): for a trade at time t,
the algorithm uses the quote corresponding to the last trade observed
at or before t + |timelag| seconds (a future quote). If no such future
trade exists, the trade has no matched quote.
In all cases the time offset is interpreted in seconds as timelag/1e6.
For example, timelag = 500000 corresponds to 0.5
seconds lag, and timelag = -2000000 corresponds to a 2-second lead.
Trades for which no suitable lagged/leading quote exists within the requested window are handled as follows:
For "Quote", the corresponding trades receive NA classifications.
For "LR", the quote-based classification is still used where
available; trades exactly at the (lagged/leading) midquote fall back to
the tick rule. When no midquote exists within the window, the result is
NA.
For "EMO", the bid/ask from the lagged/leading quote is used when
available. If no such quote exists, the EMO quote-based step is skipped
and the tick rule classification is retained.
LR recommend the use of mid-spread five-seconds earlier ('5-second'
rule) mitigating trade misclassifications for many of the 150
NYSE stocks they analyze. On the other hand, in more recent studies such
as \insertCitepiwowar2006;textualPINstimation and
\insertCiteAktas2014;textualPINstimation, the use of
1-second lagged midquotes are shown to yield lower rates of
misclassifications. The default value is set to 0 seconds (no time-lag).
Considering the ultra-fast nature of today's financial markets, time-lag
is in the unit of milliseconds. Shorter than 1-second lags can also be
implemented by entering values such as 100 or 500.
The function classify_trades() returns a dataframe of five variables. The
first four variables are obtained from the argument data: timestamp,
price, bid, ask. The fifth variable is isbuy, which takes the value
TRUE, when the trade is classified as a buyer-initiated trade, and FALSE
when the trade is classified as a seller-initiated trade.
The function aggregate_trades() returns a dataframe of two
(or three) variables. If fullreport is set to TRUE, then
the returned dataframe has three variables {freq, b, s}. If
fullreport is set to FALSE, then the returned dataframe has
two variables {b, s}, and, therefore, can be #'directly used for the
estimation of the PIN and MPIN models.
# There is a preloaded dataset called 'hfdata' contained in the package.
# It is an artificially created high-frequency trading data. The dataset
# contains 100 000 trades and five variables 'timestamp', 'price',
# 'volume', 'bid', and 'ask'. For more information, type ?hfdata.
xdata <- hfdata
xdata$volume <- NULL
# Use the LR algorithm with a timelag of 0.5 seconds i.e. 500000
# microseconds to classify high-frequency trades in the dataset 'xdata'
lgtrades <- classify_trades(xdata, "LR", timelag = 500000, verbose = FALSE)
# LR algorithm with a 0.5-second lead (-500000 microseconds)
ldtrades <- classify_trades(xdata, "LR", timelag = -500000, verbose = FALSE)
# Compare the number of buyer- and seller-initiated trades between the
# lagged and leading LR classifications.
comparison_tbl <- rbind(
transform(lgtrades, version = "lag of 0.5s"),
transform(ldtrades, version = "lead of 0.5s")
)
comparison_tbl <- with(comparison_tbl,
aggregate(list(Buys = as.logical(isbuy), Sells = !as.logical(isbuy)),
by = list(version = version),
FUN = sum, na.rm = TRUE)
)
show(comparison_tbl)
# Use the EMO algorithm with a timelag of 1 second, i.e. 1000000 microseconds
# to aggregate intraday data in 'xdata' at a frequency of 15 minutes.
emotrades <- aggregate_trades(xdata, algorithm = "EMO", timelag = 1000000,
frequency = "min", unit = 15, verbose = FALSE)
# Use the Quote algorithm with a timelag of 1 second to aggregate intraday
# data in the dataset 'xdata' at a daily frequency.
qtrades <- aggregate_trades(xdata, algorithm = "Quote", timelag = 1000000,
frequency = "day", unit = 1, verbose = FALSE)
# Since the argument 'fullreport' is set to FALSE by default, then the
# output 'qtrades' can be used directly for the estimation of the PIN
# model, namely using pin_ea().
estimate <- pin_ea(qtrades, verbose = FALSE)
# Show the estimate
show(estimate)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.