In joranE/fispdfparsr: Parse FIS Race Result PDFs

knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>",
  fig.path = "README/README-fig",
  cache = FALSE,
  cache.path = "README/README-cache-"
)

fispdfparsr

fispdfparsr is a collection of utilities for parsing PDF cross-country race results from FIS. The focus initially is results from World Cups from recent seasons.

It has seen extremely limited testing and will surely break on some PDFs, since FIS likes to monkey with their formatting pretty regularly. If you find a PDF where it breaks, please file an issue and include a link to the offending PDF.

For some reason, some PDFs are just parsed as garbage, and I haven't quite figured out why.

Installation

fispdfparsr uses the tabulizer package to do the acutal PDF parsing, which is currently available only on GitHub, not CRAN. Follow the installation instructions there first, which may involved installing some Java dependencies.

Once you have installed tabulizer, you can install fispdfparsr via:

install.packages("devtools")
devtools::install_github("joranE/fispdfparsr")

Usage

Once the package and its dependencies are installed, we can use it as follows:

library(fispdfparsr)
#Some example PDFs included in fispdfparsr
dst_pth <- system.file("example_pdfs/dst_example1.pdf",package = "fispdfparsr")
spr_pth <- system.file("example_pdfs/spr_example1.pdf",package = "fispdfparsr")

Distance

Note that for distance races we need to explicitly provide the race distance, since it isn't easily retreivable from the PDF. Be aware that I've found the parser to be somewhat slow, so be patient.

dst <- parse_dst_pdf(file = dst_pth,race_distance = 15)
dst

The result is a data.frame (specifically a tbl_df) with one row for each athlete at each split. This means that the values in the first 6 columns are repeated for each athlete for each split. The columns split_time, split_rank and split_time_back are specific to each split for each athlete.

The package includes a function to produce some simple plots. Note that choosing type = "time" or type = "percent" will likely lead to lots of overplotting of athlete's names. There is only so much space on a plot.

require(ggplot2)
p <- dst_split_plot(data = dst,type = "percent",nation_col = "SWE")
print(p)

require(ggplot2)
p <- dst_split_plot(data = dst,type = "rank",name_col = c("HOFFMAN Noah","HARVEY Alex"))
print(p)

Sprint

The sprint race plots focus on each round of heats. Parsing the sprint PDFs is a bit trickier, since the Java PDF parser we're relying on will silently omit blank table columns. This means that if no one in the final came from QF3, that column will be blank and the parser will simply drop it.

parse_spr_pdf tries to detect when this happens and prompts you to supply the index for where to insert a blank column. It should only be an issue for the final and semifinal sections of the PDF. However, this does mean that this function generally can only be run interactively and with some supervision.

spr_example1 <- parse_spr_pdf(file = spr_pth)

The result should look like this:

data(spr_example1)
spr_example1

...a data.frame with one row for each athlete for each heat. The heats are labeled with abbreviations like qf1 or sf2.

The sprint plots are a little different and focus on the progression through each round of heats, grouped by Qualification, Quarterfinal, Seminfinal and Final. Times and ranks in each group are plotted across heats (i.e. all quarter final times are compared to each other) but they are annotated so you can tell the difference between QF1, QF2, etc.

require(ggplot2)
p <- spr_heat_plot(data = spr_example1,type = "centered",nation_col = "SWE")
print(p)

require(ggplot2)
p <- spr_heat_plot(data = spr_example1,
                   type = "rank",
                   name_col = c("FALK Hanna","HAGA Ragnhild"))
print(p)