knitr::opts_chunk$set(echo = F, warning = F, message = F) ## Set options options(scipen = 999, # prevent scientific notation on large numbers stringsAsFactors = F) # prevent R from turning everything to factors
working_directory <- here::here() load(file.path(working_directory,"temp/data_report.RData")) library(tidyverse) library(lubridate) library(xts) library(factoextra) library(cluster) library(dygraphs) library(ggExtra)
r data_source
r constituent_index
r start_backtest
to r end_backtest
knitr::kable(metrics_of_interest <- unique(unlist(lapply(ticker_data, function(x) colnames(x)))), caption = "Metrics that Appear in Ticker Data", col.names = c("Metrics"))
knitr::kable(unique(c(volume_data, price_related_data, last_price_field)), caption = "Required for Scoring", col.names = c("Metrics"))
# get a list of all data types knitr::kable(unique(datalog$data_type), caption = "Data Types in Datalog", col.names = c("Data Type"))
By design, all data that enters this environment lands in the datalog
directory. From there, it is deduplicated and placed in a directory struture under the datasets
directory.
At present, the datalog contains r number_datalog_files
files, of which r number_feather_files
and feather
files and r number_csv_files
are csv
files. these numbers shouldbe equal if the csv
to feather
conversion is working as intended.
The following table provides tocker counts from both the datalog
and dataset
directories. These numbers should be identical in each directory. Note that there may be differing numbers of market and funamental tickers, since some market tickers share an ISIN and, therefore, fundamental data.
|Metric |Datalog |Dataset |
|---------------------|----------|---|
|Market tickers |r nrow(datalog_unique_market_tickers)
| r length(dataset_market_tickers)
|
|Fundamental tickers |r nrow(datalog_unique_fundamental_tickers)
| r length(dataset_fundamental_tickers)
|
We now explore the dataset directory. In order to keep this relevant to the current backtest, we look only at r constituent_index
results from the r data_source
data source. The loaded dataset is limited to the fields specified by the metrics
computed parameter at the bottom of the parameters
file. If no fundamental_metrics
of market_metrics
are specified, all metrics found in the dataset are retained by default.
The following plot shows the percentage of entries in the loaded dataset that are NA
, broken down by field and month. This gives us a high-level view of the completeness of the dataset and helps us determine a good start and end date for our backtests. We want to include as much blue in our backtest window as possible.
Note that this plot excludes fundamental data. Fundamental data has much lower periodicity, so most of the entries will be NA
. In practice, these values are backfilled with the last known value after imposing a lag. Backfilling for the purposes of this analysis does not provide us with any useful information because a ticker can have a single entry at the beginning of the dataset that fills all subsequent dates. That entry will be stale and affect the backtest. Therefore dataset quality cannot be assessed by looking at NA
values each month.
monthly_nas_plot dygraph(monthly_nas_averages*100, main = "Market Data Monthly NA Averages", ylab = "% NA") %>% dyLegend(width = 600, show = "follow") %>% dyOptions(maxNumberWidth = 20, stackedGraph = FALSE) %>% dyRangeSelector %>% dyHighlight(highlightSeriesOpts = list(strokeWidth = 3))
As mentioned above, NA
values are backfilled at runtime, so it is possible to run a backtest free of look ahead bias with spotty data. But the accuracy of the results may suffer. If the NA
proportion is unacceptably high, a tester has two options.
There are r nrow(constituent_tickers)
unique tickers in the index, and r nrow(metadata_tickers)
metadata tickers in the index. A mismatch between these two figures indicates that some tickers may be dropped during the course of the backtest because of insufficient metadata. In order to run a quality backtest, we need to have metadata, market data and fundamental data available for as many constituents as possible. Each rebalancing round, constituents without enough data will be dropped.
A ticker without any metadata will be dropped from the backtest. A ticker with incomplete metadata may be dropped if the metadata fields are relevant (eg a sector-specific algorithm). A ticker without fundamental data may be dropped from the backtest if fundamental data is used in the backtest.
knitr::kable(constituents_without_metadata, caption = "Constituents without Metadata", col.names = c("Ticker")) # elements in constituent tickers but not in metadata # note - metadata doesn't have awareness of indexes knitr::kable(constituents_without_marketdata, caption = "Constituents without Market Data", col.names = c("Ticker")) knitr::kable(constituents_without_fundamentals, caption = "Constituents without Fundamental Data", col.names = c("Ticker"))
Some data sources adjust fundamental data for look ahead bias, others do not. Sometimes adjustment is a parameter that is set in the user's account, which can change over time on a shared terminal. We therefore try to identify whether fundamental data has been adjusted for look-ahead bias or not.
This situation is complicated by the fact that some data sources will lag some data but not others. Ideally all daily data that does not need lag adustment should reside in the marketdata
datasets. But sometimes this is not possible. For example, Bloomberg market data is tied to exchange which a ticker is trading on, but fundamental data is tied to the ticker's fundamental ticker, which could be on another exchange. Querying daily fundamental data on the market data ticker may result in an NA
, even if the data is not lagged. Querying these metrics on the fundamental ticker will yield meaningful data, but at the cost that the data will be mixed up with slow-moving financial statement data that requires adjustment.
Plotting the number of occurrences of each metric can tell us if there is a mixed dataset. A mixed dataset will have two obviously different populations.
The below plot shows the silhouette of a k-means clustering algorithm run on a range of assumed cluster counts. The x-axis number corresponding to the maximum y-axis value is the probable number of clusters.
if(exists("fundamental_metrics_silhouette_plot")) { fundamental_metrics_silhouette_plot }
The different populations of metrics are auto-labeled by using k-means clustering on observation counts for a recent sample. This backtester assumes there is either one or two clusters. If there is one cluster, the lag operation specified in the parameters
file is applied to the entire dataset. If there are two clusters, the lag operation is applied only to those metrics that are part of the cluster with the lowest number of counts.
if(exists("fundamental_metric_clusters_plot")) { fundamental_metric_clusters_plot + ylab("Cumulative Number of Occurrences") + xlab("Metric") + removeGrid() }
The lag adjustment specified in the parameters
file is applied to the metrics identified as being part of the cluster of least-occurring metrics above. This lag parameter is reckoned in days.
Financial years tend to follow a strong pattern of official releases at the end of the quarter, half-year, or year. We can plot counts of releases of financial information to see if they cluster around these dates. If so, a lag should be imposed. The duration of the lag requires a judgment call from the tester; this lag is set in the parameters
file.
The most frequently occurring dates in the lag metrics dataset is shown below. These figures are derived from the raw data and will not adjust to the lag parameters in the parameters file.
if(exists("most_frequent_lag_dates")) { knitr::kable(most_frequent_lag_dates, caption = "Top 7 Frequency Counts per Date Among Lag Metrics", col.names = c("Date", "Occurrences")) }
If there are any lag adjusted metrics in the metric-filtered dataset, the most frequently occurring dates in the lag-adjusted metrics will be shown below. There should be a direct relationship between these dates and the dates above (ie each date above should be r fundamental_data_lag_adjustment
) days ahead of the dates below.
if(exists("lag_adjusted_date_counts")) { knitr::kable(lag_adjusted_date_counts, caption = "Top 7 Frequency Counts per Date Among Lag-Adjusted Metrics, 10% Sample", col.names = c("Date", "Occurrences")) }
Overall completeness checks are a good first pass, but they do not tell us how healthy teach ticker's data is. Tickers can exist in the dataset but have such poor information that they are excluded at runtime because an appropriate weight (or price) cannot be assigned to them.
In order to assess dataset quality, we investigate what percentage of the entries in a dataset are missing. That is, in a dataset of dimensions m x n
, we compute what percentage of the m x n
entries return NA
.
Approximately r percent_missing_metadata
% of the metadata dataset are missing. Tickers missing from the metadata dataset are detailed above. The fields which contain NA
values are as follows.
knitr::kable(colnames(incomplete_metadata), col.names = c("Fields containing NA"))
Missing metadata is not necessarily a problem, as it is sometimes not needed for a backtest. However, if it is a critical field, such as the fundamental_identifier
, then tickers missing that data will be dropped from the backtest.
We need to check the health of each ticker's dataset and create a score. Then we can compare across stocks to get a view of our total dataset health.
Things we are concerned with -
knitr::kable(round(all_quantiles,3), caption = "NA Quantiles for all Market Related Metrics", col.names = "Proportion")
75% of the tickers in the dataset have about 1 NA per r 1/as.numeric(all_quantiles[4])
entries in each field group. The worst ticker has r as.numeric(all_quantiles[5])*100
% of the entries missing for a field group.
knitr::kable(na_scores %>% select(-ticker) %>% colMeans() %>% round(.,3), caption = "Proportion of Metrics that are NA", col.names = "Proportion")
We plot the histograms of NA occurrences per ticker per group. To aid in readability, we clip this to the lower 90% of the ticker population.
library(gridExtra) library(grid) na_clip <- quantile(as.matrix(na_scores %>% select(-ticker)), c(0.90)) hist_scale <- round(na_clip,3) score_metric <- colnames(na_scores)[-length(colnames(na_scores))] h1 <- na_scores %>% ggplot() + aes(`Overall Market Fields`) + geom_histogram(breaks=seq(0, hist_scale, by=hist_scale/20)) h2 <- na_scores %>% ggplot() + aes(`Algo Market Fields`) + geom_histogram(breaks=seq(0, hist_scale, by=hist_scale/20)) h3 <- na_scores %>% ggplot() + aes(`Price Related Fields`) + geom_histogram(breaks=seq(0, hist_scale, by=hist_scale/20)) h4 <- na_scores %>% ggplot() + aes(`Overall Market Fields`) + geom_histogram(breaks=seq(0, hist_scale, by=hist_scale/20)) h5 <- na_scores %>% ggplot() + aes(`Volume`) + geom_histogram(breaks=seq(0, hist_scale, by=hist_scale/20)) h6 <- na_scores %>% ggplot() + aes(`Last Price`) + geom_histogram(breaks=seq(0, hist_scale, by=hist_scale/20)) grid.arrange(h1, h2, ncol=2, top=textGrob("Histogram of NA proportion per ticker, clipped to 90th Percentile", gp=gpar(fontsize=14))) grid.arrange(h3, h4, ncol=2, top=textGrob("Histogram of NA proportion per ticker, clipped to 90th Percentile", gp=gpar(fontsize=14))) grid.arrange(h5, h6, ncol=2, top=textGrob("Histogram of NA proportion per ticker, clipped to 90th Percentile", gp=gpar(fontsize=14))) knitr::kable(bottom_10_stats %>% round(.,3), caption = "Proportion of Metrics that are NA (Worst 10%)", col.names = "Proportion")
Note that these statistics do not correct for the fact that ticker may be delisted. If a ticker delists, its entries will return NA
. Computing NA
scores adjusting for period-on-period exclusion requires looping these computations over each date in the constituent lists.
The price related, volume and last price fields can be imputed so some extent. But the algo market fields cannot, and this will result in tickers being dropped from the portfolio computations.
Raw data may not be adjusted for splits and other corporate actions.
Fundamental data is only released on a periodic basis, so we cannot expect to have data entries every day. Fundamental data that is put into a daily dataframe
structure will have many NA
entries. Our dataset creation algorithm backfills these NA
values with the last known value when the data is being loaded into memory.
complete_market_field_list
complete_fundamental_field_list
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.