race_data | R Documentation |
Three weeks of horse race data from tracks worldwide.
data(race_data)
A data.frame
object with 36,418 observations and 19 columns.
The columns are defined as follows:
EventId
An integer ID denoting the event (race). These range from 1 to 4486.
TrackId
An integer ID number of the the track. There are 64 different tracks represented.
Type
The type of event, one of “Thoroughbred” or “Harness”.
RaceNum
The integer race number within a group of races at a track on a given date.
CorrectedPostTime
The ‘corrected’ post time of the race, in the form %Y-%m-%d %H:%M:%S
,
presumably in the PDT time zone. Has values like “2019-03-05 02:30:00”.
Yards
The length of the race, in yards.
SurfaceText
A string, one of
“Turf”, “Dirt”, “All-Weather” or NA
.
HorseName
The string name of the horse.
HorseId
A unique integer ID for each horse. As different horses can have the same name, this ID is constructed from the name of the Horse, the Sire and the Dam.
Age
The age of the horse, in integer years, at the time of the event. Typically less than 10.
Sex
A single character denoting the sex of the horse. I
believe the codes are
“M” for “Mare” (female four years or older),
“G” for “Gelding”,
“F” for “Filly” (female under four years of age),
“C” for “Colt” (male under four years of age),
“H” for “Horse” (male four years of age and up),
“R” for “Rig” (hard to explain),
“A” for “???”. There are some NA
values as well.
Weight_lbs
The weight in integer pounds of the jockey and any equipment. Typically around 120.
PostPosition
The integer starting position of the horse. Typically there is a slight advantage to starting at the first or second post position.
Medication
One of several codes indicating any medication the horse may be taking at the time of the race. I believe “L” stands for “Lasix”, a common medication for lung conditions that is thought to give horses a slight boost in speed.
MorningLine
A double indicating the “morning betting line” for win bets on the horse. It is not clear how to interpret this value, perhaps it is return on a dollar. Values range from 0.40 to 80.
WN_pool
The total combined pool in win bets, in dollars, on this horse at post time.
PL_pool
The total combined pool in place bets, in dollars, on this horse at post time.
SH_pool
The total combined pool in show bets, in dollars, on this horse at post time.
Finish
The integer finishing position of the horse. A 1 means first place. We only collect values of 1, 2, and 3, while
the remaining finishing places are unknown and left as NA
.
The author makes no guarantees regarding correctness of this data.
Steven E. Pav shabbychef@gmail.com
Data were sourced from the web. Don't ask.
library(dplyr)
data(race_data)
# compute win bet efficiency
efficiency <- race_data %>%
group_by(EventId) %>%
mutate(ImpliedOdds=WN_pool / sum(WN_pool,na.rm=TRUE)) %>%
ungroup() %>%
mutate(OddsBucket=cut(ImpliedOdds,c(0,0.05,seq(0.1,1,by=0.10)),include.lowest=TRUE)) %>%
group_by(OddsBucket) %>%
summarize(PropWin=mean(as.numeric(coalesce(Finish==1,FALSE)),na.rm=TRUE),
MedImpl=median(ImpliedOdds,na.rm=TRUE),
nObs=n()) %>%
ungroup()
if (require('ggplot2') && require('scales')) {
efficiency %>%
ggplot(aes(MedImpl,PropWin,size=nObs)) +
geom_point() +
scale_x_sqrt(labels=percent) +
scale_y_sqrt(labels=percent) +
geom_abline(slope=1,intercept=0,linetype=2,alpha=0.6) +
labs(title='actual win probability versus implied win probability',
size='# horses',
x='implied win probability',
y='observed win probability')
}
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.