# Necessary for CRAN to avoid CPU / elapsed time ratios being too high nthr <- data.table::setDTthreads (1)
gtfsrouter
includes a function,
gtfs_traveltimes()
,
to efficiently calculate travel times from a nominated station to all other
stations within a GTFS feed. The function takes only two main parameters, the
first specifying a departure station, and the second specifying a pair of
start_time_limits
determining the earliest and latest possible departure
times from that station. The function will then return the fastest connections
to all possible end stations for services departing within the specified
start_time_limits
.
For example, travel times to all stations for services leaving a nominated
station between 12:00 and 13:00 can be extracted by specifying
start_time_limits = c (12, 13) * 3600
. The following code uses the internal
Berlin data to demonstrate, starting by reading the GTFS data and constructing
a timetable for a specified day. This second step, achieved by calling the
gtfs_timetable()
function,
is necessary prior to calculating travel times.
library (gtfsrouter) berlin_gtfs_to_zip () f <- file.path (tempdir (), "vbb.zip") gtfs <- extract_gtfs (f, quiet = TRUE) gtfs <- gtfs_timetable (gtfs, day = 3)
The
gtfs_traveltimes()
,
function may then be called by specifying a start station (from
), and
start_time_limits
, as in the following code:
from <- "Alexanderplatz" start_time_limits <- c (12, 13) * 3600 tt <- gtfs_traveltimes ( gtfs, from = from, start_time_limits = start_time_limits ) head (tt)
The function returns a data.frame
in which each row details the fastest
connection to each station, in terms of the start time and duration of that
trip, as well as the number of transfers necessary.
The previous code returns a data.frame
detailing traveltimes for the following stations:
nrow (tt)
This relatively low number is because the berlin_gtfs_to_zip()
function
creates only a small sample portion of a full feed, extending over only one
hour and containing a small number of stations. The total number of stations
reachable from the r from
station during these times are:
nrow (gtfs$stops)
The
gtfs_traveltimes()
,
returns travel times to only a subset of all potentially reachable stations
(r nrow (tt)
instead of r nrow (gtfs$stops)
) for two main reasons. First,
not all stations may actually be reachable from a given station, and travel
times will only be returned for those that are. Secondly, the
gtfs_traveltimes()
,
function has an additional parameter, max_traveltime
, and only returns travel
times to stations able to be reached within this upper limit. The default value
is 3600, or one hour. For the token feed used here, the maximum travel times
returned are:
library (hms) hms (as.integer (max (tt$duration)))
In this case that value actually reflects the restricted data included in the sample data set, but applying the function to an actual GTFS feed will by default record traveltimes to all stations reachable within one hour. The following shows a more realistic example using a full version of the Berlin GTFS data:
gtfs <- extract_gtfs ("/<path>/<to>/vbb.zip") gtfs <- gtfs_timetable (gtfs, day = 3) tt <- gtfs_traveltimes ( gtfs, from = from, start_time_limits = start_time_limits ) nrow (tt) hms::hms (as.integer (max (tt$duration))) ## [1] 8556 ## 01:00:00
The number of stations reached in the full feed is far more, and the maximum
trip duration is indeed precisely one hour. This parameter of max_traveltime
defaults to the relatively short value of one hour because calculation times
increase non-linearly with numbers of stations reached. The following graphs
show calculation times as a function both of max_traveltime
, and numbers of
stations reached.
Click one the arrow to the left to see code used to generate these results
maxt <- 3600 + 0:10 * 1800 # 1-6 hours in half-hour intervals
dat <- vapply (
maxt, function (i) {
st <- system.time (
res <- gtfs_traveltimes (
gtfs,
from = from,
start_time_limits = start_time_limits,
max_traveltime = i
)
)
return (c (st [3], nrow (res))) },
numeric (2)
)
dat <- data.frame (
max_time = maxt / 3600, # in hours
calc_time = dat [1, ],
n_stns = dat [2, ] / 1000
)
par (mfrow = c (1, 2))
plot (dat$max_time, dat$calc_time,
pch = 19, col = "gray",
xlab = "Max Traveltime (hours)",
ylab = "Calculation Time (seconds)"
)
lines (dat$max_time, dat$calc_time)
plot (dat$n_stns, dat$calc_time,
pch = 19, col = "gray",
xlab = "Thousands of Stations Reached",
ylab = "Calculation Time (seconds)"
)
lines (dat$n_stns, dat$calc_time)
maxt <- 3600 + 0:10 * 1800 calc_time <- c ( 0.914, 1.246, 1.911, 2.156, 2.333, 2.513, 2.889, 3.344, 3.784, 4.200, 4.661 ) n_stns <- c ( 8556, 12530, 15989, 19364, 21752, 23628, 24628, 25004, 25191, 25295, 25352 ) dat <- data.frame ( max_time = maxt / 3600, # in hours calc_time = calc_time, n_stns = n_stns / 1000 ) par (mfrow = c (1, 2)) plot (dat$max_time, dat$calc_time, pch = 19, col = "gray", xlab = "Max Traveltime (hours)", ylab = "Calculation Time (seconds)" ) lines (dat$max_time, dat$calc_time) plot (dat$n_stns, dat$calc_time, pch = 19, col = "gray", xlab = "Thousands of Stations Reached", ylab = "Calculation Time (seconds)" ) lines (dat$n_stns, dat$calc_time)
Those graphs show that increasing the max_traveltime
parameter leads to
approximately linear increases in the calculation times required for
gtfs_traveltimes()
,
to execute, while relationships with numbers of stations actually reached are
highly non-linear. The panel on the right side shows that increases in
max_traveltime
eventually have little effect on increasing numbers of
stations reached, yet increases computation times considerably. Values for
max_traveltime
should accordingly only be adjusted after first determining
appropriate values. In particular, note that although the Berlin GTFS has
a large number of distinct stops:
nrow (gtfs$stops) ## [1] 41577
Many of these reflect stop_id
values for multiple platforms at a single station. The number of distinct stop names is in fact much less:
length (unique (gtfs$stops$stop_name)) ## [1] 13090
A rule-of-thumb is to increase the max_traveltime
parameter until the number
of stations reached exceeds by some small amount the total number of actual
stops in a system. In the case of these Berlin data, a maximum traveltime of
2 hours is sufficient.
nrow (gtfs_traveltimes ( gtfs, from = from, start_time_limits = start_time_limits, max_traveltime = 7200 )) ## [1] 15989
These additional stations reached naturally tend to be stations on the
periphery of the system, while doubling the max_traveltime
parameter
approximately doubles the calculation time (in this case, according to the
graphs shown above). Appropriate values will depend on the nature of desired
results. Analyses focussed on more central parts of a system can use smaller
values, with shorter corresponding calculation times. Analyses for which
peripheral stations are important may need to use larger values of
max_traveltimes
, yet should do so carefully to avoid undue increases in
calculation times.
The
gtfs_traveltimes()
,
function has an additional parameter, minimise_transfers
, which can be used
to return either the fastest possible connections from a start station to all
other reachable stations within a system, or the fastest connection which has
the least possible number of transfers. Connections with the least transfers
may be slower than fastest possible connections, as shown by the following
analysis.
tt_fastest <- gtfs_traveltimes ( gtfs, from = from, start_time_limits = start_time_limits ) tt_min_tr <- gtfs_traveltimes ( gtfs, from = from, start_time_limits = start_time_limits, minimise_transfers = TRUE ) # non-dplyr join: tt_fastest <- tt_fastest [tt_fastest$stop_id %in% tt_min_tr$stop_id, ] tt_min_tr <- tt_min_tr [tt_min_tr$stop_id %in% tt_fastest$stop_id, ] dat <- data.frame ( stop_id = tt_fastest$stop_id, fastest_dur = as.numeric (tt_fastest$duration / 3600), # hours fastest_ntr = tt_fastest$ntransfers, min_tr_dur = as.numeric (tt_min_tr$duration / 3600), min_tr_ntr = tt_min_tr$ntransfers ) 60 * mean (dat$min_tr_dur - dat$fastest_dur) # in minutes ## [1] 3.957052
mean (dat$fastest_ntr - dat$min_tr_ntr) ## [1] 0.2818428
Minimal-transfer trips take on average just under 4 minutes longer, while requiring 0.28 fewer trips. Most fastest trips are nevertheless also minimal-transfer trips, as can be seen by the number of identical values returned by these two queries.
length (which (dat$min_tr_ntr == dat$fastest_ntr)) / nrow (dat) ## [1] 0.6875221
So almost 70% of faster trips are also trips with fewest number of transfers, with around 30% of fastest trips involving more than the minimal possible number of transfers.
The
gtfs_traveltimes()
,
is based on a new algorithm specifically developed for this package, and which
will be described here in further detail in subsequent releases of the package.
data.table::setDTthreads (nthr)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.