knitr::opts_chunk$set( collapse = TRUE, comment = "#>" ) library(gtfstools)
GTFS feeds are often used to describe very large and complex public transport networks. These large files can become quite cumbersome to use, manipulate and move around, and it's not unfrequent that one wants to analyze a specific subset of the data.
gtfstools includes a few functions to filter GTFS feeds, thus allowing for faster and more convenient data processing. The filtering functions currently available are:
filter_by_agency_id()
filter_by_route_id()
filter_by_service_id()
filter_by_shape_id()
filter_by_stop_id()
filter_by_trip_id()
filter_by_route_type()
filter_by_weekday()
filter_by_time_of_day()
filter_by_sf()
This vignette will introduce you to these functions and will cover their usage in detail.
We will start by loading the packages required by this demonstration into the current R session:
library(gtfstools) library(ggplot2)
agency_id
, route_id
, service_id
, shape_id
, stop_id
, trip_id
or route_type
:The first six work in a very similar fashion. You specify a vector of identifiers, and the function keeps (or drops, as we'll see soon) all the entries that are in any way related to this id. Let's see how that works using filter_by_trip_id()
:
path <- system.file("extdata/spo_gtfs.zip", package = "gtfstools") gtfs <- read_gtfs(path) utils::object.size(gtfs) head(gtfs$trips[, .(trip_id, trip_headsign, shape_id)]) # keeping trips CPTM L07-0 and CPTM L07-1 smaller_gtfs <- filter_by_trip_id(gtfs, c("CPTM L07-0", "CPTM L07-1")) utils::object.size(smaller_gtfs) head(smaller_gtfs$trips[, .(trip_id, trip_headsign, shape_id)]) unique(smaller_gtfs$shapes$shape_id)
We can see from the code snippet above that the function not only filters the trips
table, but all other tables that contain a key that can be identified via its relation to trip_id
. For example, since the trips CPTM L07-0
and CPTM L07-1
are described by the shapes 17846
and 17847
, respectively, these are the only shapes kept in smaller_gtfs
.
The function also supports the opposite behaviour: instead of keeping the entries related to the specified identifiers, you can drop them. To do that, set the keep
argument to FALSE
:
# dropping trips CPTM L07-0 and CPTM L07-1 smaller_gtfs <- filter_by_trip_id( gtfs, c("CPTM L07-0", "CPTM L07-1"), keep = FALSE ) utils::object.size(smaller_gtfs) head(smaller_gtfs$trips[, .(trip_id, trip_headsign, shape_id)]) head(unique(smaller_gtfs$shapes$shape_id))
And the specified trips (and their respective shapes as well) are nowhere to be seen. Please note that, since we are keeping many more entries in the second case, the resulting GTFS object, though smaller than the original, is much larger than in the first case.
The same logic demonstrated with filter_by_trip_id()
applies to the functions that filter feeds by agency_id
, route_id
, service_id
, shape_id
, stop_id
and route_type
.
Frequently enough one wants to analyze service levels on certain days of the week or during different times of the day. The functions filter_by_weekday()
and filter_by_time_of_day()
can be used to this purpose.
The first one takes the days of the week you want to keep/drop and also includes a combine
argument that controls how multi-day filters work. Let's see how that works with a few examples:
# keeping entries related to services than run on saturdays AND sundays smaller_gtfs <- filter_by_weekday( gtfs, weekday = c("saturday", "sunday"), combine = "and" ) smaller_gtfs$calendar[, c("service_id", "sunday", "saturday")] # keeping entries related to services than run EITHER on saturdays OR on sundays smaller_gtfs <- filter_by_weekday( gtfs, weekday = c("sunday", "saturday"), combine = "or" ) smaller_gtfs$calendar[, c("service_id", "sunday", "saturday")] # dropping entries related to services that run on saturdaus AND sundays smaller_gtfs <- filter_by_weekday( gtfs, weekday = c("saturday", "sunday"), combine = "and", keep = FALSE ) smaller_gtfs$calendar[, c("service_id", "sunday", "saturday")] # dropping entries related to services than run EITHER on saturdays OR on # sundays smaller_gtfs <- filter_by_weekday( gtfs, weekday = c("sunday", "saturday"), combine = "or", keep = FALSE ) smaller_gtfs$calendar[, c("service_id", "sunday", "saturday")]
Meanwhile, filter_by_time_of_day()
takes the beginning and the end of a time block (the from
and to
arguments, respectively) and keeps the entries related to trips that run within the specified block. Please note that the function works a bit differently depending on whether a trip's behaviour is described using the frequencies
and the stop_times
tables together or using the stop_times
table alone: the stop_times
entries of trips described in frequencies
should not be filtered, because they are just "templates" that describe how long it takes from one stop to another (i.e. the departure and arrival times listed there should not be considered "as is"). Let's see what that means with an example:
smaller_gtfs <- filter_by_time_of_day(gtfs, from = "05:00:00", to = "06:00:00") head(smaller_gtfs$frequencies) # stop_times entries are preserved because they should be interpreted as # "templates" head(smaller_gtfs$stop_times[, c("trip_id", "departure_time", "arrival_time")]) # had the feed not had a frequencies table, the stop_times table would be # adjusted frequencies <- gtfs$frequencies gtfs$frequencies <- NULL smaller_gtfs <- filter_by_time_of_day(gtfs, from = "05:00:00", to = "06:00:00") head(smaller_gtfs$stop_times[, c("trip_id", "departure_time", "arrival_time")])
When filtering the stop_times
table, we have two options. We either keep entire trips that cross the specified time block, or we keep only the trip segments within this block (default behaviour). To control this behaviour you can use the full_trips
parameter:
smaller_gtfs <- filter_by_time_of_day( gtfs, "05:00:00", "06:00:00", full_trips = TRUE ) # CPTM L07-0 trip is kept intact because it crosses the time block head(smaller_gtfs$stop_times[, c("trip_id", "departure_time", "arrival_time")]) # dropping entries related to trips that cross the specified time block smaller_gtfs <- filter_by_time_of_day( gtfs, "05:00:00", "06:00:00", full_trips = TRUE, keep = FALSE ) # CPTM L07-0 trip is gone head(smaller_gtfs$stop_times[, c("trip_id", "departure_time", "arrival_time")])
filter_by_time_of_day()
also includes a update_frequencies
argument, used to control whether the frequencies
table should have its start_time
and end_time
fields updated to fit inside/outside the specified time of day. Please read the function documentation to understand how this argument interacts with the exact_times
field.
It's not uncommon that one wants to analyze only the transit services of a smaller region contained inside a feed. The filter_by_sf()
function allows you to filter GTFS data using a given spatial extent. This functions takes a spatial sf
/sfc
object (or its bounding box) and keeps/drops the entries related to shapes and trips selected via a specified spatial operation. It may sound a bit complicated, but it's fairly easy to understand when shown. Let's create an auxiliary function to save us some typing:
plotter <- function(gtfs, geom, spatial_operation = sf::st_intersects, keep = TRUE, do_filter = TRUE) { if (do_filter) { gtfs <- filter_by_sf(gtfs, geom, spatial_operation, keep) } shapes <- convert_shapes_to_sf(gtfs) trips <- get_trip_geometry(gtfs, file = "stop_times") geom <- sf::st_as_sfc(geom) ggplot() + geom_sf(data = trips) + geom_sf(data = shapes) + geom_sf(data = geom, fill = NA) }
This function:
geom
);sf
objects cited above to show the effect of each filter_by_sf()
argument in the final result.Also, please note that our plotter()
function takes the same arguments of filter_by_sf()
(with the exception of do_filter
, which is used to show the unfiltered data), as well as the same defaults.
Let's say that we want to filter GTFS data using the bounding box of the shape 68962
. Here's how the unfiltered data looks like, with the bounding box placed on top of it.
bbox <- sf::st_bbox(convert_shapes_to_sf(gtfs, shape_id = "68962")) plotter(gtfs, bbox, do_filter = FALSE)
By default filter_by_sf()
(and plotter()
, consequently) keeps all the data related to the trips and shapes that intersect the given geometry. Here's how it looks like:
plotter(gtfs, bbox)
Alternatively you can also drop such data:
plotter(gtfs, bbox, keep = FALSE)
You can also control which spatial operation you want to use to filter the data. This is how you'd keep the data that is contained inside the given geometry:
plotter(gtfs, bbox, spatial_operation = sf::st_contains)
And, simultaneously using spatial_operation
and keep
, this is how you'd drop the data contained inside the geometry:
plotter(gtfs, bbox, spatial_operation = sf::st_contains, keep = FALSE)
All filtering functions return a GTFS object readily available to be manipulated and analyzed using the rest of gtfstools' toolkit. For more information on how to use other functions made available by the package, please see the introductory vignette.
Any scripts or data that you put into this service are public.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.