library(nzcrash) library(dplyr) library(tidyr) library(magrittr) library(stringr) library(ggplot2) library(scales) library(lubridate)
This package redistributes crash statistics already available from the New Zealand Transport Agency, but in a more convenient form.
It's a large package (over 20 megabytes, compressed).
The crashes dataset describes most facts about a crash. The datasets causes,
vehicles, and objects_struck describe facts that are in a many-to-one
relationship with crashes. They can be joined to the crashes dataset by the
common id column. The causes dataset can additionally be joined to the
vehicles dataset by the combination of the id and vehicle_id columns.
This is most useful when the resulting table is also joined to the crashes
dataset.
The data was last scraped from the NZTA website on r Sys.Date(). At
that time, the NZTA had published data up to r max(crashes$date).
dim(crashes) dim(causes) dim(vehicles) dim(objects_struck)
The NZTA, doesn't agree with itself about recent annual road tolls, and this dataset gives a third opinion.
crashes %>% filter(severity == "fatal") %>% group_by(year = year(date)) %>% summarize(fatalities = sum(fatalities))
Crashes categorised as "fatal", "serious", "minor" or "non-injury", based on the casualties. If there are any fatalities, then the crash is a "fatal" crash, otherwise if there are any 'severe' injuries, the crash is a "serious" crash.
The definition of a 'severe' injury is not clear.
Minor and non-injury crashes are likely to be under-recorded since they often do not involve the police, who write most of the crash reports upon which these datasets are based.
A common mistake is to confuse the number of fatal crashes with the number of fatalities.
crashes %>% filter(severity == "fatal") %>% nrow sum(crashes$fatalities)
Three columns of the crashes dataset describe the date and time of the crash
in the NZST time zone (Pacific/Auckland).
date gives the date without the timetime gives the time where this is available, and NA otherwise. Times are
stored as date-times on the first of January, 1970.datetime gives the date and time in one value when both are available, and
NA otherwise. date is always available, however time is not.When aggregating by some function of the date, e.g. by year, then always start
from the date column unless you also need the time. This ensures against
accidentally discounting crashes where a time is not recorded.
crashes %>% filter(is.na(time)) %>% count(year = year(date)) %>% ggplot(aes(year, n)) + geom_line() + ggtitle("Crashes missing\ntime-of-day information") crashes %>% filter(is.na(time)) %>% count(year = year(date)) %>% mutate(percent = n/sum(n)) %>% ggplot(aes(year, percent)) + geom_line() + scale_y_continuous(labels = percent) + ggtitle("Percent of crashes missing\ntime-of-day information")
r percent(nrow(filter(crashes, !is.na(easting)))/nrow(crashes)) of
crashes have coordinates. These have been converted from the NZTM projection to
the WGS84 projection for convenience with packages like ggmap.
Because New Zealand is tall and skinny, you can easily spot the main population centres with a simple histogram.
crashes %>% ggplot(aes(northing)) + geom_histogram(binwidth = .1)
There can be many vehicles in one crash, so vehicles are recorded in a separate
vehicles dataset that can be joined to crashes by the common id column.
crashes %>% inner_join(vehicles, by = "id") %>% count(vehicle) %>% arrange(desc(n))
There can be many objects struck in one crash, so these are recorded in a separate
objects_struck dataset that can be joined to crashes by the common id column.
Q: What are more fatal, trees or lamp posts?
crashes %>% inner_join(objects_struck, by = "id") %>% filter(object %in% c("Trees, shrubbery of a substantial nature" , "Utility pole, includes lighting columns") , severity != "non-injury") %>% # non-injury crashes are poorly recorded count(object, severity) %>% group_by(object) %>% mutate(percent = n/sum(n)) %>% select(-n) %>% spread(severity, percent)
A: Trees (Don't worry, I know it's harder than that.)
Causes can be joined either to the crashes dataset (by the common id
column), or to the vehicles dataset (by both of the commont id and
vehicle_id) columns.
The main cause groups are given in the causes_category column.
crashes %>% inner_join(causes, by = "id") %>% group_by(cause_category, id) %>% tally %>% group_by(cause_category) %>% summarize(n = n()) %>% arrange(desc(n)) %>% mutate(cause_category = factor(cause_category, levels = cause_category)) %>% ggplot(aes(cause_category, n)) + geom_bar(stat = "identity") + theme(axis.text.x = element_text(angle = 90, hjust = 1, vjust = .5))
That's odd -- where are speed, alcohol, and restraints? They're given in cause_subcategory.
causes %>% filter(cause_subcategory == "Too fast for conditions") %>% count(cause) %>% arrange(desc(n))
There's nothing there about speed limit violations, because it's impossible to tell what speed a vehicle was going at when it crashed.
More worryingly, how is "Alcohol test below limit" a cause for a crash? Hopefully they filter those out when making policy decisions.
levels(causes$cause) <- # Wrap facet labels str_wrap(levels(causes$cause), 13) crashes %>% inner_join(causes, by = "id") %>% filter(cause_subcategory %in% c("Alcohol or drugs")) %>% group_by(cause, id) %>% tally %>% group_by(cause) %>% summarize(n = n()) %>% # This extra step deals with many causes per crash arrange(desc(n)) %>% mutate(cause= factor(cause, levels = cause)) %>% ggplot(aes(cause, n)) + geom_bar(stat = "identity") + theme(axis.text.x = element_text(angle = 90, hjust = 1, vjust = .5)) rm(causes) # Because we messed up the factor levels
This time, join causes to both vehicles and crashes to assess the
drunken cyclist menace.
crashes %>% filter(severity == "fatal") %>% select(id) %>% inner_join(vehicles, by = "id") %>% filter(vehicle == "Bicycle") %>% inner_join(causes, by = c("id", "vehicle_id")) %>% count(cause) %>% arrange(desc(n))
I think we all know what "Wandering or wobbling" means.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.