knitr::opts_chunk$set( collapse = TRUE, comment = "#>" )
library(SwimmeR) library(rvest) library(dplyr) library(ggplot2) library(scales)
SwimmeR
was developed to work with results from swimming competitions. Results are often shared as webpages (.html) or PDF documents, which are nice to read, but make data difficult to access.
SwimmeR
solves this problem by importing & cleaning .html and .pdf files containing swimming results, and returns a tidy dataframe.
Importing is performed by Read_Results
which takes as an argument a file path as file
and a node
(for .html only).
SwimmeR
includes Texas-Florida-Indiana.pdf, results from a tri-meet between the three schools. It can be read in as such:
file_path <- system.file("extdata", "Texas-Florida-Indiana.pdf", package = "SwimmeR") file_read <- Read_Results(file = file_path)
file_read[294:303]
Here we see a subsection of the meet - the top three finishers in the Women's 100 Yard Breaststroke featuring Olympic gold medalist Lilly King.
The next step is to convert this data to a dataframe using Swim_Parse
. Because Swim_Parse
works on text strings it is very sensitive to typos and/or nonstandard naming conventions. "Texas-Florida-Indiana.pdf" has two examples of these potential problems.
The first is that Indiana University
is sometimes entered as Indiana University
, with two spaces between Indiana
and University
. This is a problem because Swim_Parse
will interpret two spaces as a column separator, and will not properly capture Indiana University
(two spaces) as a team name.
The second issue is that Texas
and Florida
are styled as Texas, University of
and Florida, University of
which will cause Swim_Parse
to interpret them as Lastname, Firstname
.
Both of these issues can be fixed with the typo
and replacement
arguments to Swim_Parse
. Elements of typo
will be replaced by the element of replacement
with which they share an index, so all instances of the first element of typo
will be replaced by the first element of replacement
etc. etc. Not specifying typo
or replacement
will not produce an error, but might negatively impact the results. If your results look strange, or are missing values, look for typos related to those swims.
There is a third argument to Swim_Parse
called avoid
which will be addressed in the section on reading in html results below.
df <- Swim_Parse(file = file_read, typo = c("\n", "Indiana University", ", University of"), replacement = c("\n", "Indiana University", ""))
Here are those same Women's 100 Breaststroke results, as a dataframe in tidy format:
df[67:69,]
Please note that SwimmeR
does not capture split times.
Reading html results is very similar to reading pdf results, but a value must be specified to node
, containing which CSS node the Read_Results
should look in for results. Here results from the New York State 2003 Girls Championship meet will be read in, from the "pre" node.
url <- "http://www.nyhsswim.com/Results/Girls/2003/NYS/Single.htm" url_read <- Read_Results(file = url, node = "pre")
url_read[587:598]
Looking at the raw results above one will see that line 2 is a header and contains NY State Rcd:
, showing the New York State record. Lines of this type are a common feature in swimming results, but because they contain a recognizable swimming time, without being a result per say, they can cause problems for Swim_Parse
. Like typos these will not cause an error, but might produce nonsense rows in the resulting dataframe. Swim_Parse
deals with strings that should not be included in results with the avoid
argument. By default avoid
contains a lot of common formulations of these header items under avoid_default
. You can create your own list of strings as pass it to avoid
, or add to avoid_default
via avoid_new <- c(avoid_default, "your string here")
. Avoid
should also include "r\\:"
if your results have reaction times (avoid_default
already includes "r\\:"
).
df_1 <- Swim_Parse(file = url_read, avoid = c("NY State Rcd:"))
df_1[313:315,]
Once results are captured in R as tidy dataframes the real fun can begin - but there's another problem. Times in swimming are recorded as minutes:seconds.hundredth. This is fine when a time is less than a minute, because 59.99
can be of class numeric
in R, but times greater than or equal to a minute 1:00.00
are stuck as class character
. SwimmeR
provides two functions, sec_format
and mmss_format
to convert between times as seconds (for doing math), and times as minutes:seconds.hundredth, for swimming-specific display.
data(King200Breast)
King200Breast
Included in SwimmeR
is King200Breast
, containing all Lilly King's 200 Breaststroke times for her NCAA career. Times recorded as character values, in standard minutes:seconds.hundredth format. We can use sec_format
to format them as seconds, and mmss_format
to go back to minutes:seconds.hundredth. Both functions work well with the tidyverse
packages.
King200Breast <- King200Breast %>% mutate(Time_sec = sec_format(Time), Time_swim_2 = mmss_format(Time_sec)) King200Breast
This is useful for comparing times, or plotting
King200Breast %>% ggplot(aes(x = Date, y = Time_sec)) + geom_point() + scale_y_continuous(labels = scales::trans_format("identity", mmss_format)) + theme_classic() + labs(y= "Time", title = "Lilly King NCAA 200 Breaststroke")
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.