library(RcappeR)
This vignette walks through the use of zipf_init
, which initialises a handicap using a collection of races, in this case the gulfstream dataset (used in Data Preparation vignettes). There are certain steps necessary ahead of using zipf_init
, these are covered in the Data Preparation vignette, but will be covered again here.
Load the dataset:
data(gulfstream)
The gulfstream dataset contains r length(unique(paste(gulfstream$date, gulfstream$race, sep = "_")))
unique races, this is not a huge number, the more races the better. A look at the structure of gulfstream.
str(gulfstream)
In order to use some of the more complex functions (zipf_init
, zipf_hcp
) a certain amount of preparation is required. There are a number of variables needed for handicapping, these are:
The variables above should be pretty common in a racing dataset that you wish to calculate ratings from. In the gulfstream dataset we have all the above. Individual final times for horses might be a hurdle, but lengths beaten is a much more common variable, and as covered in the Data Cleaning vignette, the conv_margins
can convert lengths beaten into final times.
A unique race id is required in the gulfstream dataset, but this can be created by concatenating the date
and race
variables. Obviously if a dataset contains races at more than one racecourse, it would be wise to include something about that: you can't have two races being run at the same track, on the same day at the same time. Let's create a variable called date_race
:
gulfstream$date_race <- paste(gulfstream$date, gulfstream$race, sep = "_")
The above date_race
variable was the only one missing from the above list, but before handicapping can begin we need to calculate margins between horses that take into account the following:
The best (imo) way to do this is to use the package dplyr
which takes advantage of the %>%
pipe function from magrittr
to calculate the necessary variables. The code below processes the gulfstream dataset, creating the necessary variables. It is explained in more detail below the code, the functions used from RcappeR are btn_sec
, lbs_per_sec
and diff_at_wgts
:
library(dplyr) new_gulfstream <- gulfstream %>% group_by(date_race) %>% mutate(btn_sec = btn_sec(fintime), scale = lbs_per_sec(dist = dist, surf = "dirt"), btn_lbs = scale * btn_sec, diff_wgts = diff_at_wgts(btn_lbs = btn_lbs, wgt_carried = wgt))
library(dplyr)
gulfstream <- gulfstream %>%
group_by(date_race)
btn_sec = btn_sec(fintime),
scale = lbs_per_sec(dist = dist, surf = "dirt"),
btn_lbs = scale * btn_sec,
diff_at_wtgs(btn_lbs = btn_lbs, wgt_carried = wgt))
At this stage, the gulfstream dataset can be entered into zipf_init
. First a word about the methodology for initialising the handicap.
The handicapping methodology uses a version of race standardisation first explained by Simon Rowlands, Head of Research at Timeform, specifically using Zipfs Law (hence the names of this family of functions, see also ?zipf_race
and ?zipf_hcp
).
Race standardisation looks at races of similar class/type and assesses the performance of one winner, by assessing the performance of winners in the different, but similar, races. A more detailed explanation can be found in the Zipf Race vignette, which walks through a simple example using the zipf_race
function, which is called by zipf_init
(and zipf_hcp
).
Race standardisation uses past ratings from similar types/classes of race to assess a new race, in initialising a handicap there are no past ratings. So the zipf_init
function group races together and assess performances using margins between horses - the diff_wgts
variable created above. This process builds a skeleton handicap, from which further handicapping can, and should, be undertaken.
Below is a simple table explaining the various inputs to zipf_init
:
param | details | example input
------|---------|---------
races | a dataframe of races | new_gulfstream
group_by | name(s) of variables to group races by | "race_type"
(could also include value
)
race_id | name of variable to identify the unique races in the races
dataframe | "date_race"
btn_var | name of variable containing margins between horses in races
dataframe | "diff_wgts"
.progress | plyr's progress bar, useful when using on large datasets (>20k rows) as the function takes time to run | "text"
So:
start.time <- Sys.time() our_hcp <- zipf_init(races = new_gulfstream, group_by = "race_type", race_id = "date_race", btn_var = "diff_wgts") end.time <- Sys.time() time.taken <- end.time - start.time
our_hcp <- zipf_init(races = new_gulfstream, group_by = "race_type", race_id = "date_race", btn_var = "diff_wgts", .progress = "text")
This small example, handicapping r length(unique(new_gulfstream$date_race))
races, split into r length(unique(new_gulfstream$race_type))
different race types (r unique(new_gulfstream$race_type)
), took r time.taken
seconds.
The output from zipf_init
is a list (of class "rcapper_zipf_init"), there are print and summary methods for this class of object (though both do the same):
our_hcp
summary(our_hcp)
There is also a plot method, perhaps the most useful, which plots the distribution of ratings for each group, as we can see below the small samples in a couple of the race types shows the need for more races, or at least making sure groups are of a decent size.
plot(our_hcp)
The plot shows a distribution of ratings (in lbs) for the winners in the r length(unique(new_gulfstream$date_race))
races in new_gulfstream
dataset. The mean will always be around 0, for all race types. The next step is to assign a standard rating for a winner of this type/class of race. These standards should reflect the difference in ability (in lbs) between the different race types, so a standard rating for Grade 1 winner is going to be far greater than that of a Maiden race, what these differences are is unknown - I am working on a solution to help find these differences.
Possible solutions to this issue is to use ratings from other handicappers to help guide this process, for example, Timeform (including Timeform US) or Beyer class pars.
Finally, merge_zipf_init
function will merge the resulting ratings from zipf_init
with the dataset used to calculate the ratings. Finally print the first 20 rows, showing the variables created in this vignette and the zipf_rtg for runners:
initial_hcp <- merge_zipf_init(zipf_list = our_hcp, races = new_gulfstream, btn_var = "diff_wgts") # Let's have a look at the first few rows of our skeleton handicap initial_hcp %>% select(race_type, date_race, pos, fintime, btn_sec:zipf_rtg) %>% head(15)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.