quotesCleanup: Cleans quote data
In jonathancornelissen/highfrequency: Tools for Highfrequency Data Analysis

quotesCleanup

R Documentation

Cleans quote data

Description

This is a wrapper function for cleaning the quote data in the entire folder dataSource. The result is saved in the folder dataDestination.

In case you supply the argument qDataRaw, the on-disk functionality is ignored and the function returns the cleaned quotes as xts or data.table object (see examples).

The following cleaning functions are performed sequentially: noZeroQuotes, exchangeHoursOnly, autoSelectExchangeQuotes or selectExchange, rmNegativeSpread, rmLargeSpread mergeQuotesSameTimestamp, rmOutliersQuotes.

Usage

quotesCleanup(
  dataSource = NULL,
  dataDestination = NULL,
  exchanges = "auto",
  qDataRaw = NULL,
  report = TRUE,
  selection = "median",
  maxi = 50,
  window = 50,
  type = "standard",
  marketOpen = "09:30:00",
  marketClose = "16:00:00",
  rmoutliersmaxi = 10,
  printExchange = TRUE,
  saveAsXTS = FALSE,
  tz = NULL
)

Arguments

`dataSource`	character indicating the folder in which the original data is stored.
`dataDestination`	character indicating the folder in which the cleaned data is stored.
`exchanges`	vector of stock exchange symbols for all data in dataSource, e.g. `exchanges = c("T","N")` retrieves all stock market data from both NYSE and NASDAQ. The possible exchange symbols are: A: AMEX N: NYSE B: Boston P: Arca C: NSX T/Q: NASDAQ D: NASD ADF and TRF X: Philadelphia I: ISE M: Chicago W: CBOE Z: BATS . The default value is `"auto"` which automatically selects the exchange for the stocks and days independently using the `autoSelectExchangeQuotes`
`qDataRaw`	`xts` or `data.table` object containing raw quote data, possibly for multiple symbols over multiple days. This argument is `NULL` by default. Enabling it means the arguments `dataSource` and `dataDestination` will be ignored. (only advisable for small chunks of data)
`report`	boolean and `TRUE` by default. In case it is true and we don't use the on-disk functionality, the function returns (also) a vector indicating how many quotes were deleted by each cleaning step.
`selection`	argument to be passed on to the cleaning routine `mergeQuotesSameTimestamp`. The default is `"median"`.
`maxi`	spreads which are greater than median spreads of the day times `maxi` are excluded.
`window`	argument to be passed on to the cleaning routine `rmOutliersQuotes`.
`type`	argument to be passed on to the cleaning routine `rmOutliersQuotes`.
`marketOpen`	passed to `exchangeHoursOnly`. A character in the format of `"HH:MM:SS"`, specifying the starting hour, minute and second of an exchange.
`marketClose`	passed to `exchangeHoursOnly`. A character in the format of `"HH:MM:SS"`, specifying the closing hour, minute and second of an exchange.
`rmoutliersmaxi`	argument to be passed on to the cleaning routine `rmOutliersQuotes`.
`printExchange`	Argument passed to `autoSelectExchangeQuotes` indicates whether the chosen exchange is printed on the console, default is `TRUE`. This is only used when `exchanges` is `"auto"`
`saveAsXTS`	indicates whether data should be saved in `xts` format instead of `data.table` when using on-disk functionality. `FALSE` by default, which means we save as `data.table`.
`tz`	fallback time zone used in case we we are unable to identify the timezone of the data, by default: `tz = NULL`. With the non-disk functionality, we attempt to extract the timezone from the DT column (or index) of the data, which may fail. In case of failure we use `tz` if specified, and if it is not specified, we use `"UTC"`. In the on-disk functionality, if `tz` is not specified, the timezone used will be the system default.

Details

Using the on-disk functionality with .csv.zip files which is the standard from the WRDS database will write temporary files on your machine - we try to clean up after it, but cannot guarantee that there won't be files that slip through the crack if the permission settings on your machine does not match ours.

If the input data.table does not contain a DT column but it does contain DATE and TIME_M columns, we create the DT column by REFERENCE, altering the data.table that may be in the user's environment!

Value

The function converts every (compressed) csv (or rds) file in dataSource into multiple xts or data.table files.

In dataDestination, there will be one folder for each symbol containing .rds files with cleaned data stored either in data.table or xts format.

In case you supply the argument qDataRaw, the on-disk functionality is ignored and the function returns a list with the cleaned quotes as an xts or data.table object depending on input (see examples).

Author(s)

Jonathan Cornelissen, Kris Boudt, Onno Kleen, and Emil Sjoerup.

References

Barndorff-Nielsen, O. E., Hansen, P. R., Lunde, A., and Shephard, N. (2009). Realized kernels in practice: Trades and quotes. Econometrics Journal 12, C1-C32.

Brownlees, C.T. and Gallo, G.M. (2006). Financial econometric analysis at ultra-high frequency: Data handling concerns. Computational Statistics & Data Analysis, 51, pages 2232-2245.

Falkenberry, T.N. (2002). High frequency data filtering. Unpublished technical report.

Examples

# Consider you have raw quote data for 1 stock for 2 days
head(sampleQDataRaw)
dim(sampleQDataRaw)
qDataAfterCleaning <- quotesCleanup(qDataRaw = sampleQDataRaw, exchanges = "N")
qDataAfterCleaning$report
dim(qDataAfterCleaning$qData)

# In case you have more data it is advised to use the on-disk functionality
# via "dataSource" and "dataDestination" arguments

jonathancornelissen/highfrequency documentation built on Jan. 10, 2023, 7:29 p.m.