In tvwallace/gefcom17Model: Code used to compete in Gefcom 17

knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>",
  fig.path = "man/figures/README-",
  out.width = "100%"
)

gefcom17Model

This package contains the code I used to compete in GEFCom2017, a hierarchical probabilistic load forecasting contest. More details concerning the contest may be found at:

https://en.wikipedia.org/wiki/GEFCom#GEFCom_2017

http://www.drhongtao.com/gefcom/2017

Competing under the nom de plume 'ch46', I finished 10th out of about 70 academic and company teams.

The requirement of this contest was to produce estimated quantiles of electricity demand for the eight load zones in ISO New England (ISONE), as well as two aggregated zones: Massachusetts, and the entire ISO.

Load and weather data was from ISONE, although one had the option to supplement it. I chose not to, competing in the defined track.

The evaluation metric was the so-called pinball loss, averaged over all quantiles. It is defined below, where $\tau$ is the target quantile:

$\begin{aligned} F_{\tau}(y,\hat{y}) & = (y - \hat{y}) \tau \hspace{1cm}\textrm{ if } y \geq \hat{y} \\\ & = (\hat{y} - y) (1 - \tau) \textrm{ if } \hat{y} > y \end{aligned}$

Accurate temperature estimates are a requirement for accurate load estimates. I estimated temperature quantiles by estimating mean and variance of temperature, where variance was the squared residuals of the mean prediction:

$\sigma_{temp}^{2} = (temp - \hat{temp})^{2}$

From those predictions I estimated quantiles of temperature, e.g for the 90th quantile:

$q90_{temp} = \hat{mu_{temp}} + qnorm(0.90)*\hat{\sigma_{temp}}$

This methodology assumes that conditional mean and conditional median are the same. More generally, it assumes temperature is normally-distributed. I did not check these assumptions, primarily I was concerned with getting some measurement of dispersion. These assumptions may be valid for temperature; they almost certainly aren't for load.

I modeled temperature using xgboost and gamboost. For a still unknown reason I got slightly better validation results using gamboost.

To model load I tried a few different algorithms; please see the source code for details. What worked best was to model mean load with xgboost, then feed that model the estimated temperature quantiles to generate load quantiles. That model is specified as 'gt' in the source code.

I spent many hours trying to find explanatory variables, mostly interaction terms, that would enhance performance. In the end both the temperature and load models contained a modest number of variables. Those specifications may be found in the yaml configuration files in extdata (inst/extdata).

Installation

devtools::install_github('tvwallace/gefcom17model')

Example

library(gefcom17Model)

# Assuming you have a directory structure like mine
gen_cfg = '~/gefcom17/config/gen_and_dates_cfg.yaml'
wx_cfg = '~/gefcom17/config/wx_cfg.yaml'
ld_cfg = '~/gefcom17/config/load_cfg.yaml'

# one time
get_isone_smd_data(2017, 2020, '~/gefcom17/smd_data')
combine_isone_data(gen_cfg)
clean_isone_data(gen_cfg)

# lather, rinse, repeat
model_temperature(wx_cfg, gen_cfg)
model_load(ld_cfg, gen_cfg)