This paper provides a working example of an equity backtest that has been conducted in R according to the principles of transparency and reproducability in research. It provides working code to conduct the full analysis chain of equity backtesting in a way that is totally automated and modifiable. The companion git repository to this paper can be cloned and, with minimal modifications, new backtests can be created that avoid the pitfalls of common statistical biases. It is hoped that this work will make quality research into the cross section of equity returns more accessible to practitioners.
This project favours transparency and customizability over ease of use. For a sophisticated, easy to use backtesting environment see Zipline or QSTrader.
This project aims to be - - Totally transparent in the flow and transformation of data - Low-level in terms of dependencies - Highly customizable - Easy to set up in any environment
This project should be useful to:
R
or python
.This repository does not include any equity data, but it does include working scripts to automatically extract data from data vendors, and to save that data on a well-formed way. It is assumed that the user will acquire data using the scripts which have been included in this repository. This will only be possible if a user has access to the relevant data services - namely Bloomberg, DataStream or iNet.
The only files that change between different backtests are alorgithm.R
, which houses the portfolio weighting rules, and parameters.R
, which houses the backtest parameters.
To replicate the results of a backtest using this codebase, simply clone this git repository to your computer, and run the the trade.R
script.
To replicate the results of this paper on another equity index, change the index parameter in the parameters.R
file to another index.
To alter the algorithm, modify the compute_weights
function in the algorithm.R
script.
The scripts in this repository are written in a procedural manner. That is, it is intended that parameters are set prior to execution and that the scripts are executed noninteractively. The code logic flows down each script in a linear manner wherever possible. This style facilitates the auditing of how data is manipulated as it flows through the code logic.
The data pipeline scripts can be run as in a standalone fashion. This facilitates spreading scripts across multiple machines, whic can be helpful when data vendor terminals are shared or on different machines. In such instances, the query scripts can be run on each terminal, and the logs can be copied across to a single logfile for downstream processing. Logfiles are timestamped to Unix time, so logs from different machines will sort correctly.
Because each data pipeline script can be run standalone (ie from a clean R environment), they are amenable to scheduling using cron
, `anacron or similar.
Effort has been made to keep the simulation scripts as simple as possible, to keep them true to the spirit of this project. Nevertheless, the event-driven nature of the simulation renders
Wherever possible, the code is kept in a single file. Code is split between several files when the contents of the files do not naturally run as the same execution batch. For instance, in an academic setting, data is often collected from vendors through the use of shared terminals, often situated in a library or laboratory. It does not make sense to include the data collection code with downstream processing code because downstream processing can be done on a another machine at a different time, freeing up the terminal for another user.
Researchers are often limited by their computing reosurces. Most of the time, research is conducted on consumer hardware with limited amounts of RAM. To accommodate this, workloads are chunked or run sequentially to limit RAM consumption. This slows down execution time. The worst offending code chinks have been parallelized to mitigate this, and data is stored on disk in feather
format to speed up disk I/O.
parameters.R
) until the backtest date range ends (backtest mode) or never (live mode).
a. A runtime dataset is created, adjusting for survivorship and look-ahead bias.
b. The runtime dataset is passed to the algorithm.R
function, and target portfolio weights are computed.
c. Transaction logs and trade history is loaded to compute current portfolio positions.
d. An order list is generared from the diff of target weights and existing weights.
e. Orders are submitted to the trading engine.
f. The trading engine reads market data for the tickers and submit trades.
g. If volume limits are not hit, the trade completes and is appended to the trade history log. Cash movements are appended to the transaction log. The data query scripts save four kinds of dataset to the datalog
folder. These are -
The files in the datalog
directory can be identified for their filename. The filename
contains the following substrings, where each substring is separated by two underscores (ie __
)
A sample filename is 1029384859940__BLOOMBERG__constituent_list_data__20121001_TOP40.csv
This file naming convention allows the datalog to be searchable by a conbination of substrings. This is leveraged by the dataset builder to transform the datalog into well-formed datasets.
This project will continually change. A user is advised to compare between commits on github to see how the code changes over time. It is anticipated that this codebase will be altered significantly and rapidly until December 2018; thereafter development will slow. A good way to check the development status is look at the github commit logs.
It is advised that a user of this codebase fork the repository in order to stabilize their codebase at a particular point in time. When sharing your work, you can facilitate peer review and replication by including access to your forked repository.
There is no guaranteed support for this project past December 2018, but users are free to fork and develop it on their own. Wherever possible, the code is written in R
to ensure compatibility with generally available computer software for the forseeable future.
This codebase is being built out by Riaz Arbi, who can be contacted via riazarbi.github.io. I take no responsibility for any errors contained in this code. A user of this code should verify each line of the codebase to ensure that it operates as expected.
This is the intellectual property of Riaz Arbi. This code will be released under a strong copyleft license. A license is included in the repository.
| Feature | Planned | Implemented | |:----------------------------------------------------------------------|:-------------:|:-------------:| | Specification: Log formats |X |X | | Pipeline: Bootstrap regeneration if data lost |X |X | | Log Dataset: Keep query logs forever |X |X | | Mining: Summary statistics of logfiles | | | | Pipeline: Accept arbitrary data sources |X |X | | Pipeline: Bloomberg query script |X |X | | Pipeline: Datastream query script |X | | | Pipeline: iNet query script |X | | | Pipeline: Random Ticker Dataset Generator |X |X | | Pipeline: Data compaction |X |X | | Pipeline: Log to dataset |X |X | | Specification: Dataset formats |X |X | | Master Dataset: Data versioning |X |X | | Mining: Summary statistics of datasets | | | | Pipeline: Join ticker, constituent and metadata into master dataset |X |X | |In-memory Dataset: Automate loading |X |X | |Scoring Dataset: Compute ticker returns timeseries for scoring | | | |Scoring Dataset: Compute benchmark returns timeseries for scoring | | |
| Feature | Planned |Implemented | |:------------------------------------------------------|:-------------:|:-------------:| |Easily switch between backtesting and live trading |X | | |Specify generic portfolio weighting rule |X |X | |Automatically load master dataset |X |X | |Simulate market data stream |X |X | |Build trading engine |X |X | |Model slippage and trading costs |X |x - no slippage| |Keep track of trades |X |X | |Keep track of cash balances |X |X | |Build event-driven while-loop |X |X | |Inside loop: dynamically refresh dataset |X |X | |Inside loop: Eliminate survivorship bias |X |X | |Inside loop: Eliminate look-ahead bias |X |X | |Inside loop: Create target portfolio weights |X |X | |Inside loop: Track existing portfolio weights |X |X | |Inside loop: Compute order list |X |X | |Inside loop: Submit orders to trading engine |X |X | |Inside loop: Save successful orders to transaction log |X |X | |Logging: Create backtest archive directory |X |X | |Logging: Save transactions |X |X | |Logging: Save trades |X |X | |Logging: Save algorithm script |X |X | |Logging: Save parameters script |X |X | |Evaluation: Compute portfolio returns |X |X | |Evaluation: Compare portfolio returns to risk free |X |X | |Evaluation: Generate backtest report |X |X | |Reproducibility: Create Dockerfile and docker image |X |X |
It is recommended that you have at least 4gb of RAM for data transformation; large datasets will require significantly more.
Software
A Dockerfile and docker image will be made available to facilitate the setup of an environment.
Data
This project presupposes that a user has access to a data vendor subscription. It contains R scripts that can automatically extract relevant data from the Bloomberg BBCOM.exe server. It is anticipated that support will be added for similar automated R-based extraction of DataStream and iNet sources.
Maintenance
raw
directories but a produciton system should probably talk directly to an API using a native data vendor API.Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.