race_data: Horse Race Data

Description Usage Format Note Author(s) Source Examples

Description

Three weeks of horse race data from tracks worldwide.

Usage

1

Format

A data.frame object with 36,418 observations and 19 columns.

The columns are defined as follows:

EventId

An integer ID denoting the event (race). These range from 1 to 4486.

TrackId

An integer ID number of the the track. There are 64 different tracks represented.

Type

The type of event, one of “Thoroughbred” or “Harness”.

RaceNum

The integer race number within a group of races at a track on a given date.

CorrectedPostTime

The ‘corrected’ post time of the race, in the form %Y-%m-%d %H:%M:%S, presumably in the PDT time zone. Has values like “2019-03-05 02:30:00”.

Yards

The length of the race, in yards.

SurfaceText

A string, one of “Turf”, “Dirt”, “All-Weather” or NA.

HorseName

The string name of the horse.

HorseId

A unique integer ID for each horse. As different horses can have the same name, this ID is constructed from the name of the Horse, the Sire and the Dam.

Age

The age of the horse, in integer years, at the time of the event. Typically less than 10.

Sex

A single character denoting the sex of the horse. I believe the codes are “M” for “Mare” (female four years or older), “G” for “Gelding”, “F” for “Filly” (female under four years of age), “C” for “Colt” (male under four years of age), “H” for “Horse” (male four years of age and up), “R” for “Rig” (hard to explain), “A” for “???”. There are some NA values as well.

Weight_lbs

The weight in integer pounds of the jockey and any equipment. Typically around 120.

PostPosition

The integer starting position of the horse. Typically there is a slight advantage to starting at the first or second post position.

Medication

One of several codes indicating any medication the horse may be taking at the time of the race. I believe “L” stands for “Lasix”, a common medication for lung conditions that is thought to give horses a slight boost in speed.

MorningLine

A double indicating the “morning betting line” for win bets on the horse. It is not clear how to interpret this value, perhaps it is return on a dollar. Values range from 0.40 to 80.

WN_pool

The total combined pool in win bets, in dollars, on this horse at post time.

PL_pool

The total combined pool in place bets, in dollars, on this horse at post time.

SH_pool

The total combined pool in show bets, in dollars, on this horse at post time.

Finish

The integer finishing position of the horse. A 1 means first place. We only collect values of 1, 2, and 3, while the remaining finishing places are unknown and left as NA.

Note

The author makes no guarantees regarding correctness of this data.

Author(s)

Steven E. Pav shabbychef@gmail.com

Source

Data were sourced from the web. Don't ask.

Examples

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
library(dplyr)
data(race_data)

# compute win bet efficiency
efficiency <- race_data %>%
  group_by(EventId) %>%
    mutate(ImpliedOdds=WN_pool / sum(WN_pool,na.rm=TRUE)) %>%
  ungroup() %>%
  mutate(OddsBucket=cut(ImpliedOdds,c(0,0.05,seq(0.1,1,by=0.10)),include.lowest=TRUE)) %>%
  group_by(OddsBucket) %>%
    summarize(PropWin=mean(as.numeric(coalesce(Finish==1,FALSE)),na.rm=TRUE),
              MedImpl=median(ImpliedOdds,na.rm=TRUE),
              nObs=n()) %>%
  ungroup()


if (require('ggplot2') && require('scales')) {
  efficiency %>%
    ggplot(aes(MedImpl,PropWin,size=nObs)) + 
    geom_point() + 
    scale_x_sqrt(labels=percent) + 
    scale_y_sqrt(labels=percent) + 
    geom_abline(slope=1,intercept=0,linetype=2,alpha=0.6) + 
    labs(title='actual win probability versus implied win probability',
         size='# horses',
         x='implied win probability',
         y='observed win probability')
 }

ohenery documentation built on Oct. 30, 2019, 9:53 a.m.