The analysis of mobility data often begins with the processing of data capturing the time and location of individuals. When preparing these data for analysis, the continuous timestamp data can be particularly difficult to normalize since they require the data scientist to integrate model. For example, we may want to build sequences of checkins for a given device over locations and then count the total number of transitions between locations over all individuals. Problems like these are common in mobility research, they require careful consideration based on the goals of an analysis, and software tools implementing these types of computations will provide benefits in terms of time savings and data integrity. To address these challenges we provide the checkin
package, which provides a standard set of functions for appropriately descretizing spatio-timestamp data for aggregate analysis for the R programming environment
You can install the the development version of {checkin} from GitHub with:
# install.packages("devtools")
devtools::install_github("kaneplusplus/checkin")
Consider the checkins
data set from the checkin
library shown below. The data consists of 3 columns corresponding to the device (id), time, (timestamp), and location identifier (location). There are a total of
1000 unique people/devices rangin in time from 2020-04-19 00:01:13 EST to 2020-05-08 18:58:29 EST.
library(checkin)
library(dplyr)
#>
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
data(checkins)
checkins
#> # A tibble: 14,609 × 3
#> id timestamp location
#> <int> <dttm> <int>
#> 1 442 2020-04-19 17:23:44 36496
#> 2 442 2020-04-19 17:23:35 36496
#> 3 166 2020-04-19 12:55:44 37461
#> 4 476 2020-04-19 06:23:47 33476
#> 5 456 2020-04-19 05:39:09 33468
#> 6 458 2020-04-19 10:04:20 36500
#> 7 458 2020-04-19 15:15:25 36876
#> 8 651 2020-04-19 13:31:19 37391
#> 9 652 2020-04-19 05:27:54 37469
#> 10 653 2020-04-19 05:28:59 37389
#> # … with 14,599 more rows
#> # ℹ Use `print(n = ...)` to see more rows
checkins |>
group_by(id) |>
summarize(total_ids = n())
#> # A tibble: 1,000 × 2
#> id total_ids
#> <int> <int>
#> 1 1 1
#> 2 2 2
#> 3 3 1
#> 4 4 5
#> 5 5 1
#> 6 6 3
#> 7 7 1
#> 8 8 1
#> 9 9 2
#> 10 10 2
#> # … with 990 more rows
#> # ℹ Use `print(n = ...)` to see more rows
Now suppose we would like to examine the checkins of indvidual/device number 335. We will extract all rows with id 335 and then specify an interval starting at the beginning of the hour of the first checkin to one hour later using the checkins_in_interval()
function. The code and output are below and there are two things to note. First, the first row location is NA
. This is because the individual/device does not appear before 2020-04-19 00:41:46 and a location cannot be determined. Second the original data set did not include an entry for id 335 at 2020-04-19 00:59:59. This value was carried forward from the previous location.
library(lubridate)
#>
#> Attaching package: 'lubridate'
#> The following objects are masked from 'package:base':
#>
#> date, intersect, setdiff, union
x <- checkins %>%
filter(id == 335) %>%
arrange(timestamp)
start <- x$timestamp[1]
minute(start) <- 0
second(start) <- 0
end <- start + hours(1) - seconds(1)
checkins_in_interval(x, "timestamp", start, end)
#> # A tibble: 3 × 3
#> id timestamp location
#> <int> <dttm> <int>
#> 1 NA 2020-04-19 00:00:00 NA
#> 2 335 2020-04-19 00:41:46 32576
#> 3 335 2020-04-19 00:59:59 32576
Finally, suppose we would like to get the locations of individuals/devices at the beginning of a time interval, the location at the end of the interval, and the total amount of time the individual/devices was checked into the beginning location. This operation would be performed in two steps. First, a function, from_to()
is constructed, which takes the rows corresponding to a single id
for a given interval. This function finds, the initial location (from
), the end location(to
), the timestamp at the beginning of the interval, and the duration of the initial location. In the second step, we group the checkin
data by id
and pass the result to the map_hourly_interval_dfr()
function, which applies to from_to()
function to each id
over each hourly interval. Other intervals are included in the package and the documentation includes a information on how to construct similar functions over custom intervals.
from_to <- function(it) {
it$duration <- c(diff(it$timestamp), 0)
units(it$duration) <- "secs"
it$duration <- as.numeric(it$duration)
from_duration <- sum(it$duration[it$location == it$location[1]])
tibble(from=it$location[1],
to = it$location[nrow(it)],
timestamp = it$timestamp[1],
from_duration = from_duration)
}
checkins |>
group_by(id) |>
map_hourly_interval_dfr(from_to, time = "timestamp")
#> # A tibble: 319,867 × 5
#> id from to timestamp from_duration
#> <chr> <int> <int> <dttm> <dbl>
#> 1 1 33175 33175 2020-04-22 05:00:00 3599
#> 2 2 36523 36523 2020-04-20 01:00:00 3599
#> 3 2 36523 36523 2020-04-20 02:00:00 3599
#> 4 2 36523 36523 2020-04-20 03:00:00 3599
#> 5 2 36523 36523 2020-04-20 04:00:00 3599
#> 6 2 36523 36523 2020-04-20 05:00:00 3599
#> 7 2 36523 36523 2020-04-20 06:00:00 3599
#> 8 2 36523 36523 2020-04-20 07:00:00 3599
#> 9 2 36523 36523 2020-04-20 08:00:00 3599
#> 10 2 36523 36523 2020-04-20 09:00:00 3599
#> # … with 319,857 more rows
#> # ℹ Use `print(n = ...)` to see more rows
This work was partially supported by the National Science Foundation (NSF) Grant Human Networks and Data Science - Infrastructure (HNDS-I), award numbers 2024335 and 2024233.
Please note that the checkin project is released with a Contributor Code of Conduct. By contributing to this project, you agree to abide by its terms.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.