Handle implicit missingness with tsibble

knitr::opts_chunk$set(
  warning = FALSE, message = FALSE, echo = TRUE, collapse = TRUE,
  fig.width = 7, fig.height = 6, fig.align = 'centre',
  comment = "#>"
)
options(tibble.print_min = 5)

Assuming you have set up a tsibble, you are ready for undertaking analysis in temporal context. Many time operations, such as lag and lead, assume an intact vector input ordered in time. To avoid inviting these errors into the analysis, it is a good practice to inspect any implicit missingness of the tsibble before analysis. A handful of tools are provided to understand and tackle missing values: (1) has_gaps() checks if there exists implicit missingness; (2) scan_gaps() reports all implicit missing entries; (3) count_gaps() summarises the time ranges that are absent from the data; (4) fill_gaps() turns them into explicit ones, along with imputing by values or functions. These functions have a common argument .full. If FALSE (default) looks into the time period for each key, otherwise the full-length time span. The pedestrian data contains hourly tallies of pedestrians at four counting sensors in 2015 and 2016 in inner Melbourne.

library(dplyr)
library(tsibble)
pedestrian

Indeed each sensor has gaps in time.

has_gaps(pedestrian, .full = TRUE)

So where are these gaps? scan_gaps() gives a detailed report of missing observations. count_gaps() presents a summarised table about the individual time gap alongside the number of missing observations for each key. It is clear that Birrarung Marr exhibits many chunks of missingness, while Bourke Street Mall (North) displays as many consecutive missingness as 1128. If .full = FALSE, the opening gap for Bourke Street Mall (North) will not be recognised.

ped_gaps <- pedestrian %>% 
  count_gaps(.full = TRUE)
ped_gaps

These gaps can be visually presented using ggplot2 as follows. There are two points popping out to us at all sensors, probably due to the fact that the system skipped the extra hour when switching from daylight savings to standard time.

library(ggplot2)
ggplot(ped_gaps, aes(x = Sensor, colour = Sensor)) +
  geom_linerange(aes(ymin = .from, ymax = .to)) +
  geom_point(aes(y = .from)) +
  geom_point(aes(y = .to)) +
  coord_flip() +
  theme(legend.position = "bottom")

We have learned if any missing periods and where, and we enlarge the tsibble to include explicit NA (previously as implicit missingness). The function fill_gaps() takes care of filling the key and index, and leaves other variables filled by the default NA. The pedestrian data initially contains r format(NROW(pedestrian)) and gets augmented to r format(NROW(fill_gaps(pedestrian, .full = TRUE))).

ped_full <- pedestrian %>% 
  fill_gaps(.full = TRUE)
ped_full

Other than NA, a set of name-value pairs goes along with fill_gaps(), by imputing values or functions as desired.

pedestrian %>% 
  fill_gaps(Count = 0L, .full = TRUE)
pedestrian %>% 
  group_by_key() %>% 
  fill_gaps(Count = mean(Count), .full = TRUE)


Try the tsibble package in your browser

Any scripts or data that you put into this service are public.

tsibble documentation built on Oct. 9, 2022, 9:05 a.m.