In vegawidget/vegawidget: 'Htmlwidget' for 'Vega' and 'Vega-Lite'

library("vegawidget")
library("conflicted")
library("lubridate")
library("readr")
library("dplyr")

Dates and times can be tricky, even within the familiar confines of R. Vega and Vega-Lite are run in JavaScript, which has its own philosophy for dates and times. In R, we have access to the entire time-zone database; in JavaScript, using the Date object, there are only two time zones available: the local time zone used by the client browser and UTC.

If you are working with time-based data, this mismatch of capabilities introduces a lot of "opportunities" to create Vega(-Lite) visualizations that do not behave as intended. The purposes of this article are

to walk through the pit-falls of translating dates and times from R to JavaScript.
to describe the in-and-outs of Vega-Lite time-units.
to show some (hopefully) useful practices for staying out of trouble.

The documentation for the Altair Python package includes an article on times and dates with very clear explanations; it is hoped that this article can be similarly clear.

Time zones

str_moon_landing <- "1969-07-21T02:56:15Z"
moon_landing <- parse_datetime(str_moon_landing)

Things happen at fixed instants in time; we use time zones to describe those fixed instants using a local context. Consider that Neil Armstrong's first step on the moon happened at r str_moon_landing, represented here as an ISO-8601 string.

All ISO-8601 strings refer unambiguously to UTC, the globally-recognized standard reference-frame; it is the standard way to communicate time-stamps. However, it lacks context for someone who may remember an instant from Chicago, or another who may remember the same instant from Paris.

In the context of Chicago, the first-step happened at r with_tz(moon_landing, "America/Chicago"); in the context of Paris, it happened at r with_tz(moon_landing, "Europe/Paris"). We use a system of time zones to provide this context. The standard representation of time zones is IANA (also known as Olson) time-zones. These are each associated with a given geographic region, such as "America/Chicago" or "Europe/Paris"; a time zone is used to determine the offset to UTC, given an instant expressed in UTC.

High level-programming languages, such as R, have access to the full IANA time-zone database. However, the JavaScript Date object has access to only two time zones: the timezone of the local browser, and UTC. JavaScript-based visualization systems, including d3, Plotly, and Vega, are constrained by this limitation because they depend on the JavaScript Date object.

Vega-Lite examples

Let's look at an example using one of the datasets included in this package: data_seattle_hourly, adapted from vega-datasets:

glimpse(data_seattle_hourly)

This dataset contains hourly observations of Seattle temperature (°F) for the year 2010. Let's inspect the time zone of the date variable:

tz(data_seattle_hourly$date)

We see that the date variable uses the "America/Los_Angeles" time zone, which is what we would expect for Seattle. This means that the values for date are displayed above as local time. To make the following demonstrations a little lighter, we will use an shortened version of this dataset, one that uses only 2010-01-01:

data_seattle_hourly_short <-
  data_seattle_hourly %>%
  dplyr::filter(
    floor_date(date, "day") == ymd("2010-01-01", tz = "America/Los_Angeles")
  ) %>%
  glimpse()

spec_local <- 
  list(
    `$schema` = vega_schema(),
    width = 600,
    height = 75,
    data = list(values = data_seattle_hourly_short),
    mark = "line",
    encoding = list(
      x = list(
        field = "date", 
        type = "temporal",
        axis = list(format = "%H:%M")
      ),
      y = list(
        field = "temp", 
        type = "quantitative", 
        scale = list(zero = FALSE)
      )
    )
  ) %>%
  as_vegaspec()

spec_local

Our goal is to show the times in the chart using the local time-zone; in this case, we have succeeded. Good news, but let's have a closer look at what's going on.

Here are the first couple of observations in the data frame, rendered as JSON (using the same settings for datetimes as we use when render the entire vegaspec to JSON):

data_seattle_hourly_short %>% head(2) %>% jsonlite::toJSON()

We see that date-times are not formatted using ISO-8601 formatting - the times are serialized to JSON using the local context. When Vega-Lite parses this data, it recognizes that this is a datetime and that the format is not ISO-8601. As such, these times will be parsed, interpreted, and displayed as local times:

1) Times are parsed as UTC time if the date strings are in ISO format. Note that in JavaScript date strings without time are interpreted as UTC but but date strings with time and without timezone as local. 2) If that is not the case, by default, times are assumed to be local.

This is what we intended to demonstrate, but you should be aware of some "gotchas" due to daylight-saving time:

In Seattle in 2010, the local time represented by "2010-03-14 02:00:00" does not exist; the local time represented by "2010-11-07 01:00:00" occurs twice. This is the reason that the dataset has 8759 observations rather than our expectation of 8760: the "repeated" hour has only one observation. This is one of the fragilities of storing datetimes using local time rather than UTC.
In 2010, in Europe, the daylight-saving time transitions happened on different dates from the USA - in Europe this happened on "2010-03-28" and "2010-10-31". This means that there are local times represented in this dataset that do not exist in local time in Europe, so they will not be parsed, interpreted, or displayed properly.

This is why it is a best-practice in R (and elsewhere) to serialize datetimes using the ISO-8601 format, then to treat the local time-zone, in this case "America/Los_Angeles", as metadata. However, for the reasons outlined in at the beginning of this article, we cannot get this to work using Vega(-Lite). What follows are two different ways that Vega-Lite works with UTC times, neither of which is likely what you want.

This package has a function, vw_serialize_data(), to serialize dates and times in a data frame. It has an argument to specify a format for dates, iso_date, and an option to specify a format for datetimes, iso_dttm. We set the iso_dttm argument to TRUE to serialize our datetimes using ISO-8601 format:

data_seattle_hourly_short_iso <- 
  data_seattle_hourly_short %>%
  vw_serialize_data(iso_dttm = TRUE) %>%
  glimpse()

You can see that we have changed the date variable from a datetime to a character string. Its values show date:

in the ISO-8601 format (according to the iso_dttm argument).
ISO-8601 implies using UTC (at this time, 8 hours ahead of Seattle).
to millisecond precision (matching JavaScript's time-resolution).

Let's see what our chart looks like, using this data:

spec_iso_local <- spec_local

spec_iso_local$data$values <- data_seattle_hourly_short_iso

spec_iso_local

If the browser you are using to view this page is in the "America/Los_Angeles" time zone, this chart will appear identical to the first chart. However, if your browser is not "in" this time-zone, the axis labels for date are now not what we might expect. Vega-Lite parses and interprets the date values as UTC, but displays the date using the browser's local time-zone. I (Ian) am in the "America/Chicago" time zone, so the axis, for me, begins at 02:00. This is consistent with my time zone being two hours ahead of "America/Los_Angeles".

As noted above, the JavaScript Date object knows two time-zones: the browser's and UTC. Our other option is to direct Vega-Lite to display the time as UTC:

spec_iso_utc <- spec_iso_local

spec_iso_utc$encoding$x$scale <- list(type = "utc") 

spec_iso_utc

Here, the date axis begins at 08:00, which is the UTC time when it was midnight in Seattle.

Datetime compromise

Because of the time-zone limitations of the JavaScript Date object, we are forced to make a compromise.

If we serialize datetimes using the local (to the data) time-zone, using a non-ISO-8601 format:

Vega-Lite parses, interprets, and displays the datetimes as if they are local times in the browser's time zone.
Different application of daylight-saving time across different time zones is a potential source of error in the visualization.
The data, as sent to the browser, is incomplete because we have "lost" the link to the actual instants-in-time that the data represents.

If we serialize datetime data using the ISO-8601 format, the instants are parsed and interpreted correctly. However, we have the option to display using only the time zone of the browser or UTC. It is entirely likely that the context (time zone) of our data is different from the context of our browser - in which case we are unable to make an effective visualization.

Faced with this choice, it seems the first option -- serializing our data using the data's local time-zone -- is the least-bad option, as it gives us the opportunity to present the data using its own context. That being said, we need to be mindful of the daylight-saving pitfalls, and warn our users accordingly. Further, we have to recognize (and possibly note to our users) that the data inside the visualization is compromised.

In other situations, the best-practice for serializing datetimes is to use ISO-8601 formatting and to keep the time zone as metadata. This allows us to refer, unambiguously, to the actual instants-in-time. It would be great if this were "baked into" Vega(-Lite), but it seems the limitation is fundamentally in the JavaScript Date object, suggesting that it would be a heavy lift to implement a comprehensive solution.

Notes on dates

In R, we have different types for date (Date) vs. datetimes (POSIXct). This can be a useful distinction, as using dates makes it easier for us to compare a day in New York with a day in Brisbane.

In JavaScript, we have a single type, Date, to represent both dates and datetimes; the convention is to treat dates as the datetimes at the start of that day. For a given situation, it remains to determine which time zone to use: UTC or local time.

In my opinion, this choice would depend on the larger context of the data. If working with a dataset where a date is used in association with datetimes in the same time-zone context, e.g. all associated with a single factory, then it might make more sense to think of both the date and datetimes in the same time-zone context. However, if working with a dataset with dates associated with locations in different time-zones, e.g. New York and Brisbane, it might make more sense to parse these dates using a neutral context, like UTC.

In Vega, the parsing default for datetimes is that if the data is formatted using ISO-8601, then it is parsed as UTC. To be clear, what an R user might think of as a date string: "2001-01-01", is in ISO-8601 format; accordingly, by default, it is parsed as a JavaScript Date object in the UTC context.

Let's consider the dataset data_seattle_daily, where we have daily observations of Seattle weather over the course of four years:

glimpse(data_seattle_daily)

The default JSON-serializer for dates uses the ISO-8601 format for dates. If we wanted to use a non-ISO-8601 format, we could use vw_serialize_data() using iso_date = FALSE:

data_seattle_daily %>%
  vw_serialize_data(iso_date = FALSE) %>%
  glimpse()

For the purpose of making some examples of how to work with UTC dates, let's filter the original dataset, with ISO-8601 formatting, keeping the first month:

data_seattle_daily_short <- 
  data_seattle_daily %>%
  dplyr::filter(floor_date(date, "month") == as.Date("2012-01-01")) %>%
  glimpse()

Let's look at a line-plot of the daily maximum-temperatures:

spec_tempmax_local <- 
  list(
    `$schema` = vega_schema(),
    width = 600,
    height = 75,
    data = list(values = data_seattle_daily_short),
    mark = "line",
    encoding = list(
      x = list(
        field = "date",
        type = "temporal"
      ),
      y = list(
        field = "temp_max",
        type = "quantitative"
      ),
      tooltip = list(
        list(field = "date", type = "temporal", format = "%Y-%m-%d %H:%M:%S")
      )
    )
  ) %>%
  as_vegaspec()

spec_tempmax_local

If you are in a timezone that is different from UTC (sorry UK), you will see that the points that make up the line are not aligned with the grid. This is another effect of time zones - Vega has parsed and interpreted the date as UTC, but its default is to display it to us using the local time-zone of our browser. For me, near Chicago, the days are "starting" six hours early. You can verify this by using the tooltips, which show the date with the time attached.

In this situation, we wish to keep the UTC interpretation throughout the visualization. We can do this by using a UTC scale-specification for the x-axis (and the tooltip):

spec_tempmax_utc <- spec_tempmax_local

spec_tempmax_utc$encoding$x$scale <- list(type = "utc")
spec_tempmax_utc$encoding$tooltip[[1]]$scale <- list(type = "utc")

spec_tempmax_utc

When using UTC dates (and datetimes), be mindful that you may also have to set the scales to UTC, so that the axis and tooltips behave similarly, and properly, for everyone.

Summary

Parsing

When Vega(-Lite) parses values for dates and times, it checks whether the string is formatted using ISO-8601.

Here are some examples of ISO-8601 strings:

2001-01-01T19:34:05Z
2001-01-01T19:34:05+05:00
2001-01-01

Here are some examples of non-ISO-8601 strings:

2001-01-01 19:34:05
Jan-01-2001 19:34:05
2001/01/01

By default, if the format is ISO-8601, Vega will parse the string as UTC. Similarly, by default, if the format is non-ISO-8601, Vega will parse the string as local time. You can override the default by providing a parsing directive in the specification. If you wish to change how dates and times are serialized, you can use the vw_serialize_data() function.

Datetimes

It is likely that you wish to show datetime values using the context of the data, e.g. you want to look at the temperatures in Seattle using in the context of Seattle. From R, assuming that your datetimes are using the data-local timezone, e.g. "America/Los_Angeles", the least-bad way to do this is to serialize your datetime values using data-local time, rather than using UTC; this will happen by default. Although this will let you display the data as you intend, there are a couple of important points to keep in mind:

Weird things may happen at the daylight-saving time transition-points; this problem is compounded if the transition points are different in the data than they are in the user's browser, e.g. the data's context is in the US and the user's browser is in Europe.
Your data is now compromised as your datetime strings no longer unambiguously identify the instants in time. Because Vega does not have access to the full list of time-zones, anyone who extracts the data from the Vega specification will be lacking the time-zone context.

If you wish for Vega to parse your datetimes using UTC, you can serialize your datetimes using the function vw_serialize_data() using iso_dttm = TRUE.

Dates

Date and datetimes are the same type in JavaScript; a date is just a datetime at the start of a day.

Depending on your situation, you may wish to interpret a date using UTC or using the local time-zone of the browser.

If you are comparing dates from different time-zones, e.g. Beijing and London, it may make sense for you, conceptually, to use a single "neutral" time-zone, UTC. In this case you can use the default serialization for dates, the ISO-8601 format, i.e. "2001-01-01"; Vega will parse this as UTC. However, you will have to adjust the scales in your visualization to use UTC.
If your dataset(s) have datetimes and dates from a single time-zone, e.g. a facility in "America/Chicago", it may be useful to display them using the same (local) time-zone context. Here, you can use the vw_serialize_data() function with iso_date = FALSE, such that Vega will parse the dates using the local time-zone.

Scales

In Vega-Lite, displaying temporal-values on a scale is separate from parsing temporal-values. Regardless of the parsing behavior, Vega-Lite's default temporal-scale uses the browser's local time-zone. You can specify a scale in an encoding, for example:

list(
  ...,
  encoding = list(
    x = list(field = "date", type = "temporal", scale = list(type = "utc")),
    ...
  ),
  ...
)

Vega-Lite time unit

For an interactive version of this section on time units, please see this Observable notebook.

Let's say you have daily weather data for, say, Seattle over the course of four years. In this case, we want to load the data into vega using a URL, making this page a little lighter by not carrying all the data around in each of the following charts. Let's have a look at the first few lines:

url_seattle_daily <- "https://vega.github.io/vega-datasets/data/seattle-weather.csv"

url_seattle_daily %>%
  read_lines(n_max = 6) %>%
  print()

As above, this is daily data covering the years 2012-2015 inclusive. Our goal in this section is to show how Vega-Lite time units work by creating charts that:

aggregate the data by month
compare the behavior among the years

To be clear, you could prepare such variables in a pre-processing step, creating new variables in R, using dplyr and lubridate. You can also do this in Vega-Lite, using time unit, a transformation mechanism for temporal values.

Let's start specification for a time-based scatterplot for the daily maximum-temperature, showing every observation in the dataset. For purposes of illustrating how to work with UTC values, we parse the dates as UTC. In the vegaspec below, note the format entry: parse = list(date = "utc:'%Y/%m/%d'"):

spec_daily <-
  list(
    `$schema` = vega_schema(),
    width = 600,
    height = 200,
    data = list(
      url = url_seattle_daily,
      format = list(
        parse = list(date = "utc:'%Y-%m-%d'")
      )
    ),
    mark = "point",
    encoding = list(
      x = list(
        field = "date",
        type = "temporal",
        scale = list(type = "utc")
      ),
      y = list(
        field = "temp_max",
        type = "quantitative",
        scale = list(domain = list(-5, 40))
      ),
      tooltip = list(
        list(
          field = "date", 
          type = "temporal", 
          scale  = list(type = "utc"),
          title = "date",
          format = "%Y-%m-%d"
        ),
        list(field = "temp_max", type = "quantitative")
      )
    )
  ) %>%
  as_vegaspec()

spec_daily

We use point-marks here not because it is optimal for the visualization (it probably isn't), but because it may help make clearer what is happening behind the scenes. Seeing what Vega-Lite actually does can help us to understand its time-unit API.

Let's say we want to look at the maximum temperature for each year-month in the dataset - 48 months in all. If we were to do this in R, we might make a preprocessing step like this:

data_seattle_daily %>%
  mutate(yearmonth = floor_date(date, "month")) %>%
  group_by(yearmonth) %>%
  summarise(temp_max = max(temp_max)) %>%
  glimpse()

We could then plot this data. However, we can make these data transformations within Vega-Lite. We will show this as a series of steps; the first step is to map the values of date to the beginning of the month.

# copy the previous spec
spec_yearmonth <- spec_daily

# modify the x-encoding
spec_yearmonth$encoding$x$scale <- NULL
spec_yearmonth$encoding$x$timeUnit <- "utcyearmonth"

# modify the tooltip-encoding
spec_yearmonth$encoding$tooltip <- 
  append(
    spec_yearmonth$encoding$tooltip, 
    list(
      list(
        field = "date", 
        type = "temporal", 
        timeUnit = "utcyearmonth", 
        title = "yearmonth", 
        format = "%Y-%m-%d"
      )      
    ), 
    after = 1
  )


spec_yearmonth

The main change we made to the specification was to tweak the x-encoding: removing the UTC scale and adding a timeUnit, setting it to "utcyearmonth". The specification of a time unit did three things:

It binned the date variable by truncating the time to include the year and month, hence the "yearmonth" part of the directive. This is equivalent to the dplyr expression mutate(yearmonth = floor_date(date, "month")).
Because we parsed date as UTC, we used the UTC context to determine "when" each month starts. This is why the time-unit directive starts with "utc".
It changed the default formatting for the axes to reflect that we have truncated date to its year-month.

For me, the essence of the concept is that a time unit defines a mapping of an existing temporal-variable to a new temporal-variable; it is a dplyr::mutate() that returns another datetime.

We removed the UTC scale from the x-encoding because this is taken care-of with the specification of "utc" within the time unit. If we were to specify it twice, the UTC-offset would be applied twice.

We modified the tooltip to demonstrate that we have the same underlying data. On the x-axis, we use the "mutated" year-month rather than the date.

The next step is to apply the Vega-Lite equivalents to dplyr's group_by() and summarise() operations.

# copy the previous spec
spec_yearmonth_aggregate <- spec_yearmonth

# add an aggretation directive to the y-encoding
spec_yearmonth_aggregate$encoding$y$aggregate = "max"

# remove the original date from the tooltip, add temperature-aggretation
spec_yearmonth_aggregate$encoding$tooltip[[1]] <- NULL 
spec_yearmonth_aggregate$encoding$tooltip[[2]]$aggregate <- "max"

spec_yearmonth_aggregate

These two operations are undertaken by adding a single specification: the y-encoding directive aggregate = "max". Vega-Lite carries out an implied group_by() on all the encodings are not aggregated. In this case, y (temp_max) is aggregated; we group by our remaining encoding, x, which is now the "year-month".

Also, note that we need to take care with our tooltip definitions. Tooltips are encodings, so the aggregation rule applies to tooltip encodings no different that it applies to other encodings. If something "weird" happens when building a chart that uses implicit aggregation, it can be useful to check the tooltip-definitions for encodings and aggregations that are inconsistent with the rest of the encodings (ask me how I found this out).

We can add an color-encoding for the year, again using a time-unit:

# copy previous spec
spec_yearmonth_aggregate_color <- spec_yearmonth_aggregate

# create encoding for color using "utcyear" time-unit of date
spec_yearmonth_aggregate_color$encoding$color <- 
  list(field = "date", type = "nominal", timeUnit = "utcyear")

# insert a tooltip element to indicate year
spec_yearmonth_aggregate_color$encoding$tooltip <- 
  append(
    spec_yearmonth_aggregate_color$encoding$tooltip,
    list(
      list(
        field = "date", 
        type = "temporal", 
        timeUnit = "utcyear", 
        title = "year", 
        format = "%Y-%m-%d"
      )
    ),
    after = 0
  )


spec_yearmonth_aggregate_color

Again, this is another intermediate step; creating a new internal variable using a "utcyear" time-unit. We see that Vega-Lite does the "right" thing with the color-legend. Using the tooltip, we also see that the values of year are the datetimes corresponding to the first instant in a given year.

In the previous chart, we noted that Vega-Lite will group-by any encoded variable that is not aggregated, then perform the aggregations. In this case, we are grouping by "year" and "year-month", so the end-result of the groupings is the same as the previous example.

Our final chart in this series of examples will be to put each of the years on the same scale; we do this by using a time-unit of "utcmonth" for the x-encoding.

spec_month_aggregate_color <- spec_yearmonth_aggregate_color

spec_month_aggregate_color$encoding$x$timeUnit <- "utcmonth"

spec_month_aggregate_color$encoding$tooltip[[2]]$timeUnit <- "utcmonth"

spec_month_aggregate_color

Looking at the chart itself, we can see that it is has now "arrived" at our destination: we can more-easily compare years because the months are on the same scale. Using the tooltip, you can see that the time-unit "utcmonth" has projected all of the different years into the year 1900. We are showing this projected year on the x-axis, but the axis labels indicate only the month. The specification of the "utcmonth" time-unit did four things:

It binned the date variable by truncating (flooring) the time to the least-significant unit of the time-unit directive: in this case, month.
It projected the binned date using the most-significant units not in the time-unit directive: in this case, year.
Because we parsed date as UTC, we should the UTC context to determine "when" each month starts. This is why the time-unit directive starts with "utc".
It changed the default formatting for the axes to reflect that we have transformed date such that its only significant unit is month.

Summary

A Vega-Lite timeUnit maps a datetime to another datetime.

There are a lot of choices among time units; each choice, e.g. "monthdate", "utcyear", etc., has two dimensions:

to use (or not) the UTC context, indicated by the "utc" prefix
the range of time units to keep, e.g. "monthdate". In this case, time units smaller than date, such as hours, minutes and seconds are truncated; a "flooring" datetime to the date. In this case, time units larger than month are projected into a prototype year, in this case 1900.

Finally, a timeUnit specification will set the default formatting (used for axes) for that variable.

vegawidget/vegawidget documentation built on Jan. 27, 2024, 10:48 a.m.

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

Tweet to @rdrrHQ

GitHub issue tracker

ian@mutexlabs.com