Darko Bergant August 2017
pre code { font-size: 0.8rem; } code { line-height: 1.1; }This tutorial is an overview of
trafficaccidents
data package.
The package contains information about traffic accidents in Slovenia examined by
the police from 2005 to 2016.
Original files^[The original data is published on
http://www.policija.si/]
were converted to R data frames without changes, except for:
The tutorial was created using R language [@R-base] and several R extensions. See the list of all used packages at the end of the document.
Note that this is not a thorough analysis of traffic accidents. The only goal is to bring the data and visualization tools to a larger group of users.
Information about accidents is stored in three data frames:
event
contains accident events with date, time, position and other
accident attributes. party
contains attributes of the parties involved.calendar
contains all dates from
2005 to 2016.Data structure
The majority of accidents
(93%)
are geocoded. Use pos_x
and pos_y
columns for geographic location.^[
Columns pos_x
and pos_y
are geocoded in D48/GK coordinate reference system.
Use columns lon and lat for WGS84]
For example, we can plot all accidents locations as points, coloured by
road category (road_type
):
library(dplyr)
library(ggplot2)
library(viridis)
library(trafficaccidents)
dat_roads <-
event %>%
filter(!is.na(road_type), !is.na(pos_x)) %>%
arrange(desc(road_type))
ggplot(data = dat_roads, aes(pos_y, pos_x, color = road_type) ) +
geom_point(alpha = 0.03, size = 0.2) +
coord_fixed() +
scale_color_viridis(discrete = TRUE, end = 0.7, option = "A") +
guides(colour = guide_legend(
override.aes = list(alpha = 1, size = 2), title = "Road category")) +
theme_void() +
theme(legend.position = c(0.95,0.3))
Traffic accidents locations by road category
Working days have different time distribution than weekends and holidays.
days_count <-
calendar %>%
group_by(year, working_day) %>%
summarise(days = n())
accidents_per_hour <-
event %>%
left_join(calendar, by = "date") %>%
group_by(year, working_day, hour) %>%
count %>%
left_join(days_count, by = c("year", "working_day")) %>%
mutate(p = n / days)
ggplot(accidents_per_hour, aes(x = hour, y = p)) +
geom_point(alpha = 0.2, size = 0.2, position = position_jitter(0.1)) +
geom_smooth(method = "loess", span = 1/3, color = "gray") +
scale_x_continuous(breaks = 1:7*3, name = "Hour", minor_breaks = NULL) +
scale_y_continuous(limits = c(0, NA), breaks = 0:6 * 2,
name = "Average number of accidents") +
coord_cartesian(y=c(0, 9)) +
facet_wrap(~ working_day, ncol = 1, scales = "fixed") +
theme_minimal()
With "joy plot" from ggjoy
package [@R-ggjoy] we can
discover the differences of time distributions between road categories.
The peaks at rush hour are more visible on roads than on city streets.
Time distribution of accidents on working day
library(ggjoy)
dat_rt_wd <-
event %>%
left_join(calendar, by = "date") %>%
filter(
working_day == "Working day",
!is.na(road_type)
)
ggplot(dat_rt_wd) +
geom_joy(aes(x = hour, y = road_type, fill = road_type), alpha = 0.5 ) +
scale_x_continuous(breaks = 1:7*3, minor_breaks = NULL,
limits = c(0,23), name = "Hour") +
scale_fill_viridis(discrete = TRUE, end = 0.7, option = "A")+
scale_y_discrete(limits = rev(levels(dat_rt_wd$road_type)), name = NULL)+
theme_minimal() +
theme(legend.position = "none")
With animation it is possible to see accident locations by hour. The following animation was created with gganimate package [@R-gganimate].
event_wd <-
event %>%
filter(!is.na(pos_x)) %>%
mutate(alpha = 0.2) %>%
left_join(calendar, by = "date") %>%
filter(working_day == "Working day")
shift_hour <- function(x, n, a) {
mutate(x, alpha = a, hour = (hour + n) %% 24)
}
dat_wd_shift <-
bind_rows(
event_wd, shift_hour(event_wd, 1, 0.1), shift_hour(event_wd, 2, 0.07)
) %>%
arrange(hour, desc(road_type))
clock <- list(x = 378938 + 30000, y = 191590 - 10000, width = 8500)
p1 <-
ggplot(data = dat_wd_shift, aes(frame = hour)) +
geom_point(aes(pos_y, pos_x, alpha = alpha, color = road_type), size = 0.4) +
scale_alpha_identity() +
scale_color_viridis(discrete = TRUE, end = 0.7, option = "A", guide = "none",
limits = levels(dat_wd_shift$road_type)) +
coord_fixed() +
geom_text(aes(label = sprintf("%02d", hour)), color = "gray",
x = clock$x, y = clock$y, size = 9) +
geom_segment(size = 2, aes(
x = clock$x + clock$width * sin(hour/6 * pi),
y = clock$y + clock$width * cos(hour/6 * pi),
xend = clock$x + clock$width*1.4 * sin(hour/6 * pi),
yend = clock$y + clock$width*1.4 * cos(hour/6 * pi),
color = levels(dat_wd_shift$road_type)[as.integer(hour >= 12)*8+1]
)) +
theme_void()
library(gganimate)
library(animation)
magickPath <- shortPathName("magick.exe")
ani.options(convert=magickPath)
#ani.options(ani.width = 1024, ani.height = 640+55)
ani.options(ani.width = 1024*1.3, ani.height = (640+55)*1.3)
ani.options(interval = 0.1)
gganimate(p1, "docs/animation/accidents_animation.gif", title_frame = FALSE )
All accidents are marked with injury severity (column injury
). The plot
below compares time distributions of accidents for different levels of
severities.
Distribution of accidents over time
dat <-
event %>%
left_join(calendar, by = "date")
ggplot(dat) +
geom_joy(aes(x = hour, y = injury, alpha = injury), fill = "darkred") +
scale_x_continuous(breaks = 1:7*3, minor_breaks = NULL,
limits = c(0,23), name = "Hour") +
scale_y_discrete(limits = rev(levels(dat$injury)), name = NULL)+
scale_alpha_discrete(range = c(0.06, 0.7), guide = "none") +
theme_minimal() +
facet_wrap(~working_day, ncol = 1)
Police classifies each accident by main cause. The plot below represent the number of accidents as a circle size and injury severity as color opacity for different accident causes:
dat_type <-
event %>%
group_by(cause, injury) %>%
count
ggplot(dat_type, aes(x = 1, y=1, size = sqrt(n) / 7, alpha = injury)) +
scale_size_identity() +
geom_point(color = "darkred") +
scale_y_discrete(limits = rev(levels(dat_type$cause))) +
coord_fixed(0.7) +
scale_alpha_discrete(range = c(0.06, 0.7)) +
guides(alpha = guide_legend(override.aes = list(size = 6))) +
theme_void(base_size = 12) +
#theme(strip.text = element_text(size = 10)) +
labs(x = NULL, y = NULL, alpha = "Severity") +
facet_wrap(~cause, ncol = 3)
The accident_type
column defines what happened (e.g. side collision or vehicle rollover).
The plot below explores its relation to accident cause.
dat_cause_type <-
event %>%
group_by(cause, event_type, injury) %>%
count
ggplot(dat_cause_type,
aes(x = event_type, y=cause, size = sqrt(n) / 7, alpha = injury)) +
geom_point(color = "darkred") +
scale_size_identity() +
scale_alpha_discrete(range = c(0.06, 0.7)) +
guides(alpha = guide_legend(override.aes = list(size = 10))) +
scale_x_discrete(position = "top") +
scale_y_discrete(limits = rev(levels(dat_cause_type$cause))) +
coord_fixed() +
theme_minimal() +
theme(
axis.text.x = element_text(angle=90,hjust = 0, vjust = 0.3, size = 12),
axis.text.y = element_text(size = 12),
legend.position = "bottom"
) +
labs(x = "Accident type", y = "Cause", alpha = "Severity")
There can be several parties involved in same accident. The data
includes also some cases where there are several parties at fault for the
same accident.
Here the events_party
data frame is created with single
event and some attributes from one party at fault.
party_at_fault <-
party %>%
filter(accident_role == "At fault")
other_party <-
party %>%
filter(accident_role != "At fault")
events_party <-
party_at_fault %>%
group_by(src_file, event_id) %>%
summarise(
age = first(age),
gender = first(gender),
alco = first(alcotest) > 0,
experience = first(experience_y) + first(experience_m)/12,
traffic_role = first(traffic_role)
)
events_party %>%
group_by(traffic_role) %>%
summarise(accidents = n()) %>%
arrange(desc(accidents)) %>%
head(8) %>%
kable
traffic_role accidents
Car driver 192340 Truck driver 24174 Cyclist 7862 Motorcycle driver 5290 Motorized bicycle driver 3344 Bus driver 1704 Pedestrian 1637 Tractor driver 1492
Most frequent parties at fault are car drivers. Is it true, that there are more accidents involving young drivers, driving at night?
hour_intervals <- setNames(
c(0, 6, 22),
c("Night (22h - 06h)", "Day (6h-21h)", "Night (22h - 06h)")
)
events_party_car_drivers <-
events_party %>%
filter(traffic_role %in% c("Car driver")) %>%
filter(age >= 18) %>%
inner_join(event, by = c("src_file", "event_id")) %>%
inner_join(calendar, by = "date") %>%
mutate(
hour_interval = findInterval(hour, hour_intervals),
hour_interval = names(hour_intervals)[hour_interval],
age = round(age)
)
events_party_car_drivers %>%
filter(date >= as.Date("2011-01-01")) %>%
group_by(age, hour_interval, src_file) %>%
count %>%
ggplot(aes(x = age, y = n)) +
#geom_point(alpha = 0.7) +
geom_point(alpha = 0.5, mapping = aes(color = src_file)) +
geom_smooth(alpha = 0.2, method = "loess") +
facet_wrap(~hour_interval, scales = "free_y") +
labs(x = "Age", y = "Number of accidents", color = NULL) +
scale_x_continuous(breaks = 1:10*10, limits = c(18, 85)) +
scale_color_brewer(palette = "Blues") +
theme_bw()
Number of accidents per year by car driver's age
Alcohol involved accidents, number of accidents by age at day/night:
events_party_car_drivers %>%
filter(date >= as.Date("2011-01-01"), alco) %>%
group_by(age, hour_interval, src_file) %>%
count %>%
ggplot(aes(x = age, y = n)) +
#geom_point(alpha = 0.7) +
geom_point(alpha = 0.3, mapping = aes(color = src_file)) +
geom_smooth(method = "loess", color = "orange", span = 0.6) +
facet_wrap(~hour_interval) +
labs(x = "Age", y = "Number of accidents", color = NULL) +
scale_x_continuous(breaks = 1:10*10, limits = c(18, 80)) +
scale_color_brewer(palette = "Oranges") +
theme_bw()+
theme(legend.position = "none")
Number of alcohol induced accidents per year, by car driver age
knitr 1.16
Xie Y (2017). knitr: A General-Purpose Package for Dynamic Report Generation in R. R package version 1.16, http://yihui.name/knitr/.
rmarkdown 1.5
Allaire J, Cheng J, Xie Y, McPherson J, Chang W, Allen J, Wickham H, Atkins A, Hyndman R and Arslan R (2017). rmarkdown: Dynamic Documents for R. R package version 1.5, https://CRAN.R-project.org/package=rmarkdown.
tidyr 0.6.2
Wickham H (2017). tidyr: Easily Tidy Data with 'spread()' and 'gather()' Functions. R package version 0.6.2, https://CRAN.R-project.org/package=tidyr.
dplyr 0.5.0
Wickham H and Francois R (2016). dplyr: A Grammar of Data Manipulation. R package version 0.5.0, https://CRAN.R-project.org/package=dplyr.
ggjoy 0.2.0
Wilke C (2017). ggjoy: Joyplots in 'ggplot2'. R package version 0.2.0, https://CRAN.R-project.org/package=ggjoy.
viridis 0.4.0
Garnier S (2017). viridis: Default Color Maps from 'matplotlib'. R package version 0.4.0, https://CRAN.R-project.org/package=viridis.
viridisLite 0.2.0
Garnier S (2017). viridisLite: Default Color Maps from 'matplotlib' (Lite Version). R package version 0.2.0, https://CRAN.R-project.org/package=viridisLite.
ggplot2 2.2.1
Wickham H (2009). ggplot2: Elegant Graphics for Data Analysis. Springer-Verlag New York. ISBN 978-0-387-98140-6, http://ggplot2.org.
datamodelr 0.2.1.9001
Bergant D (2017). datamodelr: Define and Plot Data Model Diagrams. R package version 0.2.1.9001, https://github.com/bergant/datamodelr.
trafficaccidents 0.1.0
Bergant D (2018). trafficaccidents: Traffic Accidents in Slovenia. R package version 0.1.0, http://github.com/bergant/trafficaccidents.
tufte 0.2
Xie Y and Allaire J (2016). tufte: Tufte's Styles for R Markdown Documents. R package version 0.2, https://CRAN.R-project.org/package=tufte.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.