knitr::opts_chunk$set(
  fig.align = 'center',
  fig.width = 4, 
  fig.height = 3,
  collapse = TRUE,
  comment = "#>"
)

Introduction

"Profile" data is common in both limnology and oceanography. Profile data is data where some sort of water quality (or more rarely, biological) parameter is measured from the water surface to depth. Often, profile data is collected at regular intervals at the same location, producing a three dimensional (time by depth by value) data set.

Common examples of time by depth data include salinity, temperature and dissolved oxygen profiles, but the idea applies whenever a parameter is collected more or less simultaneously along a vertical line.

The goal of this package is to provide a simple interface to ggplot graphics, with intelligent defaults to generate plots useful for this type of data.
In particular, profile data is usually drawn with a reversed vertical (depth) axis, so that the surface (zero depth) is at the top of the figure, with depth increasing towards the bottom of the figure. Similarly, it is common to show zero depth on the plot, even though data collection starts at some finite depth below the water surface.

The tdggraph package is principally designed to produce graphics from season-long series of periodic profile observations. The three main graphics functions create ggplot2 graphic objects with different strengths for emphasizing different aspects of the data.

Dependencies

We load several "tidyverse" package individually here, to minimize unnecessary dependencies in the package. Users can usually load them in one step with library(tidyverse).

library(dplyr)
library(tidyr)
library(purrr)
library(forcats)
library(ggplot2)

library(tdggraph)

Load Example Data

We also create a factor version of the date to simplify examples.

data(dep_sonde) 

dep_sonde <- dep_sonde %>%
  mutate(txt_date = format(sample_date, format = '%m/%d'),
         txt_date = factor(txt_date),
         txt_date = fct_reorder(txt_date, sample_date))

tdggraph Principles

Integration with tidyverse and ggplot

The functions in tdggraph are designed to integrate well with ggplot and dplyr work flows.

The first argument to each plotting function is a source data frame. They can therefore be integrated into a magrittr pipe, like the ggplot() All graphic functions use data masking, so they will look for data passed as function arguments within that data frame, before looking in the enclosing environment. Variable names in the source data frame can therefore "mask" the same variable name from the enclosing environment. This style of function argument will be familiar to users of ggplot() and the tidyverse.

The output of each plotting function is of class ggplot, so users can add additional graphic components to the plot as usual, adding layers, modifying themes, changing axis labels, altering color schemes, transforming axes, and more.

The default reversal of the Y axis can be "undone" by adding scale_y_continuous() to the function output. this will generate a warning, but the Y axis will be returned to "normal" with higher values towards the top of the figure.

Users can pass additional aesthetic specifications as part of the call to any of these functions. Any additional arguments passed to these functions are passed directly to the underlying ggplot2 geoms, allowing users to override aesthetic defaults, most of which are inherited from ggplot(). Additional arguments are passed to geoms as follows:

Passing a data dependent aesthetic specification using aes() requires some special handling. Data masking does not work within the supplementary call to aes(), so users must pass arguments to aes() that refer to objects defined in the enclosing environment. Usually, this just requires an explicit reference to the source data frame (e.g., df$value, not just value). See the example in the ptdots() "Filled Symbols" demo for an example.

All Data From the Enclosing Environment

If all data is available in the enclosing environment, you can pass NULL as the data frame parameter. This is sometimes helpful if you are using the functions inside another function, where you may not want to assemble a data frame.

Order of Arguments

In the current version of the package, all plotting functions take arguments in order of their role in the plot. The order follows the graphical logic of the plot, not the scientific logic. This is a potential source of errors and frustration, so it's worth explaining.

The order of the arguments for the plot functions is always:
data source;
variable that defines x location;
variable that defines y location;
variable that defines color.

From the scientific perspective, all these plots show depth, time, and a measured environmental variable. Those roles do not change, but the way they are depicted on a plot does.

The formal arguments for ptlines() are as follows:

ptlines(.dt, .val, .y, .grp,.sort = TRUE, ...)

Note that in this type of plot the x coordinate reflects the value of the data you are showing, so it comes first. The grouping variable defines the color, so it comes last.

The first few formal arguments for ptdots() and ptsmooth() are similar to each other:

ptdots(.dt, .x, .y, .val, ...)

ptsmooth(.dt, .x, .y, .val,
  .res_x = 0.5, .res_y = 2, 
  y_grow_grid = TRUE, y_with_zero = TRUE
)

The data source comes first, then the x coordinate (usually a time coordinate), then the y coordinate (usually depth) then the value that defines the color (the measured environmental variable).

ptlines() Demo

the ptlines() function generates a line graph with vertical profiles shown as lines. Different profiles are distinguished by color. This is an excellent format to show subtle differences in profiles between a relatively modest number of samples, as viewers can easily compare profiles against each other. It is not as good if there are many samples, as a large number of lines begins to look like spaghetti.

This function can also be used to plot profile data from multiple sites, as the logic of the plot works with qualitative, not quantitative grouping variables.

Data is preferentially found inside the data frame passed as the first argument. Arguments are in the order of x coordinate -- here a value , y coordinate -- usually the depth -- and a grouping variable -- often a date or a location. The grouping variable should evaluate to a factor or character vector, as it will always be used to divide data into separate profiles.

ptlines(dep_sonde, temp, depth, txt_date, .sort = FALSE)

Integration with tidyverse and ggplot

The function is designed to fit into a typical tidyverse work flow, for example, you can feed data into it via a pipe, and add components to the resulting ggplot object:

ptlines(dep_sonde, do, depth, txt_date, size = 1.5, lty = 2) +
  theme_minimal() +
  scale_color_brewer(type = 'qual', name = '') +
  ylab('Depth (m)') +
  xlab('Dissolved Oxygen (mg/l)')

Any supplementary aesthetics defined in mapping = aes() must refer to names from the enclosing environment. See the example in the ptdots() "Filled Symbols" demo for an example.

Data From the Enclosing Environment

If all data is available in the enclosing environment, you can pass NULL as the data frame parameter. This is helpful if you are using the functions inside another function, where you may not want to assemble a data frame.

td <- dep_sonde$txt_date
d = dep_sonde$depth
ox = dep_sonde$do

ptlines(NULL, ox, d, td, .sort = FALSE)

Data Order Matters

Under the hood this function uses geom_path() and not geom_line(), so the order of the data matters. geom_path() connects points in the order of their occurrence in the source data, while geom_line() internally orders data according to the x coordinate before plotting.

The DEP sonde data is already sorted by date and depth, which is why we were able to ignore this detail up until now. We demonstrate what happens by resorting the data we just plotted. By sorting each component of the data to the same random order, we ensure we are plotting the same data (x,y,z) triplets, only they are presented in a new random order.

You get spaghetti:

rand_order <- order(rnorm(length(td)))
td2 <- td[rand_order]
d2 <- d[rand_order]
ox2 <- ox[rand_order]
ptlines(NULL, ox2, d2, td2, .sort = FALSE)

The .sort Argument

it is a good practice to sort your data yourself before passing it to this function, but if you're in a hurry, there's a shortcut that does the job for you. You control this behavior with the `.sort' parameter, which is TRUE by default. But be forwarned: results may not quite be what you expected or hoped for.

ptlines(NULL, ox2, d2, td2, .sort = TRUE) +
  theme_minimal()

Passing a Grouping Variable

Sometimes this gets tricky. For example, suppose you want to symbolize the data by month, not by date. You have multiple sampling dates within some months. By default, geom_path() will connect all the dots in the order of the source data or the data sorted by depth, so neither .sort = TRUE or .sort = FALSE works quite as expected:

With .sort = TRUE, you get zigzags:

ptlines(dep_sonde, temp, depth, month, .sort = TRUE)

With .sort = FALSE, things look better, because the source data was already sorted, but you get unnecessary connecting lines from the bottom of one profile to the top of the next.

ptlines(dep_sonde, temp, depth, month, .sort = FALSE)

The solution is to pass a grouping variable. Note: this behaves strangely with .sort = TRUE, so you should sort your data before passing it to ptlines().

ptlines(dep_sonde, temp, depth, month,  .sort = FALSE,
        mapping = aes(group = dep_sonde$txt_date))

ptdots() Demo

The dots provide an alternative way of depicting similar data. Here, the graphic emphasizes dates over the course of a season, at the expense of making subtle differences in profiles more difficult to assess. Default colors from ggplot2 are often a poor choice here.

ptdots(dep_sonde, sample_date, depth, do, size = 5) +
  theme_minimal() +
  scale_color_gradient2(name = 'Oxygen, (mg/l)',
                        high = scales::muted("blue"), 
                        low = scales::muted("red"),
                        mid = 'gray95',
                        midpoint = 9)

Data from the Enclosing Environment

As before, we can pass NULL as a data frame parameter if all data is available in the enclosing environment.

sd <- dep_sonde$sample_date
ptdots(NULL, sd, d, ox, size = 5) +
  theme_minimal()

Filled Symbols

Any supplementary aesthetics defined in mapping = aes() must refer to names from the enclosing environment. As a result, using filled point symbols. (R point shapes 21 through 25) takes several additional steps:

  1. Add mapping = aes(fill = ?) to the function call, pointing to data in the enclosing environment, usually by qualifying the name of the data frame explicitly with $.

  2. Pass a shape argument selecting the filled symbol shape

  3. Usually, pass a color argument to define the fixed dot outline color. (color = 'gray50' is often a good starting point). If you don't specify a single color, to override the normal color aesthetic, the outline color will vary, which is probably not what you want.

  4. Usually, you will want to supplement the call to this function with a call to scale_fill_viridis_c(), scale_fill_distiller(), or scale_fill_continuous() or scale_fill_gradient() etc. to define a fill color scale that works better than the ggplot2 default.

ptdots(dep_sonde, sample_date, depth, do, size = 3, 
       shape = 22, color = 'gray25',
       mapping = aes(fill = dep_sonde$do)) +
  theme_minimal() +
  scale_fill_gradient(name = 'Oxygen, (mg/l)',
                        high = "paleturquoise1", 
                        low = scales::muted("darkgreen")) +
  xlab('') +
  ylab('Depth (m)')

ptsmooth() demo

This function produces a two dimensional interpolation plot. It can be thought of as a smoothed out version of ptdots(). It is especially useful for "eyeballing" seasonal patterns from periodic data. As before, default colors are seldom the most informative. Currently, NAs replace any grid cells below the lowermost (highest Y value) data points.

data(dep_sonde)
plt <- ptsmooth(dep_sonde, sample_date, depth, temp, 
         .res_x = 1, .res_y = .25, 
         y_grow_grid = TRUE,
         y_with_zero = TRUE) +
scale_fill_gradient(name = 'Oxygen, (mg/l)',
                        high = scales::muted("blue"), 
                        low = scales::muted("red", l = 40))
plt

It is easy to overlay the locations of the original observations if you want to show viewers where you have actual observations

plt +
  geom_point(mapping = aes(sample_date, depth), data = dep_sonde,
             shape = 21, size = 3, fill = NA)

interpol() Demo

This function is a thin wrapper around base R's approx() function, which implements simple linear interpolation. The major difference is that the return value of interpol() is a data frame, not a list. The function is used "under the hood" to generate graphics interpolated in two directions without clipping.

The basic idea for this function and its use in generating smoothed graphics for profile data were inspired by Dewey Dunnington's fishandwhistle.net blog.

By Depths / Numeric Values

dat <- dep_sonde[dep_sonde$month == 'Jul',]
grid <- seq(0, 14, 0.75)
interp <- interpol(dat$depth, dat$temp, .grid = grid)

ggplot(dat, aes(temp, depth)) +
  geom_point(size = 3, col = 'red') +
  geom_point(mapping = aes(dep, ind), data = interp, shape = 3) +
  scale_y_reverse()

By Dates / Times

tmp <- dep_sonde %>% 
  select(site, sample_date, depth, month, temp) %>%
  group_by(sample_date) %>%
  filter(depth == min(depth)) %>%
  ungroup(sample_date)

grid <- seq(min(tmp$sample_date), max(tmp$sample_date), 7)

interp <- interpol(tmp$sample_date, tmp$temp, .grid = grid, 
                   grow_grid = FALSE,
                   na.rm = TRUE)

ggplot(tmp, aes(temp, sample_date)) +
  geom_point(size = 3, col = 'red') +
  geom_point(mapping = aes(dep, ind), data = interp, shape = 3) +
  ggtitle('Fore River Site 09, July 2018')

Functional Programming

interpol() includes an optional.name` argument, which can be useful when assembling larger data frames through working with nested tibbles or functional programming. For example, one might often want to interpolate within dates, not between dates.

We calculate a grid to use for each date, then use a nested tibble to interpolate within each date before using reduce() to generate a tidy (long-form) tibble containing the results of the interpolation.

lims <- range(dep_sonde$depth) 
max_y = ceiling(lims[2])
min_y = floor(lims[1])
ygrid = seq(min_y, max_y, by = 0.5)

profs <- dep_sonde %>%
    group_by(sample_date) %>%
    nest() %>%
    mutate(prof = map(data, function(dat) interpol(dat$depth, dat$do, 
                                                   .name = sample_date,
                                                   .grid = ygrid)))
profs <- reduce(profs$prof, bind_rows) %>%
  mutate(txt_date = factor(format(id, format = '%m/%d')),
         txt_date = fct_reorder(txt_date, id)) %>%
  rename(do = dep, depth = ind)
ggplot(profs, mapping = aes(x = do, y = depth)) +
  geom_point(aes(color = txt_date))+
  geom_path(aes(color = txt_date)) +
  geom_point(data = dep_sonde, mapping = aes(x = do, y = depth))

Passing an Empty Dataframe

Should return a data frame with NA in second variable. This is useful for use in functional programming, since it correctly propagates missing values.

t <- unique(dep_sonde$sample_date) %>%
  sort
df <- tibble(x = t, y = NA_integer_)
interp <- interpol(df$x, df$y, .grid = grid, with_zero = FALSE, 
                   grow_grid = FALSE, na.rm = TRUE)
head(interp)

interpol_res() Demo

interpol_res() is an only slightly more consequential extension of base R's approx() function. It's primary claim to fame is that it allows the user to specify a resolution for evenly spaced interpolations. (approx() allows you to specify the number of interpolated points, but not their spacing). Like interpol(), it outputs a data frame, not a list.

By Depths / Numeric Values

data(dep_sonde)
dat <- dep_sonde[dep_sonde$month == 'Jul',]
interp <- interpol_res(dat$depth, dat$temp, .res = 0.5)

ggplot(dat, aes(temp, depth)) +
  geom_point(size = 3, col = 'red') +
  geom_point(mapping = aes(dep, ind), data = interp, shape = 3) +
  scale_y_reverse()

By Dates / Times

tmp <- dep_sonde %>% 
  select(site, sample_date, depth, month, temp) %>%
  group_by(sample_date) %>%
  filter(depth == min(depth)) %>%
  ungroup(sample_date)

interp <- interpol_res(tmp$sample_date, tmp$temp, .res = 5, 
                   with_zero = FALSE, na.rm = TRUE)

ggplot(tmp, aes(temp, sample_date)) +
  geom_point(size = 3, col = 'red') +
  geom_point(mapping = aes(dep, ind), data = interp, shape = 3) +
  ggtitle('Fore River Site 09, July 2018')

Gotchas

Altering the Vertical Scale on a "Reversed" Y Axis

Since the functions already define a "reversed" y axis, you need to be explicit altering the y axis again. Sometimes this leads to unexpected results.

For example, when you alter the limits of the y axis, you need to provide the scale limits in reversed order (the way they are plotted). This is true both for scale_y_reversed() and for coord_cartesian(). Note that these two commands do not quite generate identical plots, as coord_cartesian() will not prune data.

plt <- ptdots(dep_sonde, sample_date, depth, do, size = 3)

plt + scale_y_reverse(limits = c(10, 0))

plt + coord_cartesian(ylim = c(10,0))

Known Problems

Functions Do Not Work with facet_wrap() or facet_grid()

Because of some initial programming choices, none of the functions currently pass the source data frame to the ggplot geoms internally. This underlies some of the quirks of using the functions that have already been described. However, it also means the functions can not be used with facets, even though it is natural to want to do so.

Create "Long Form" Data

We create a selected data set to show what happens with each of the functions

dep_long <- dep_sonde %>%
  select(-ph, -chl_a_sonde, -turbidity, -turbidity_cens) %>%
  pivot_longer(temp:do, names_to = 'parameter', values_to = 'value')

Example with ptlines()

It is easy to generate a faceted graphic like our ptlines() function output by hand:

tmp <- dep_long %>%
  mutate(dates = factor(sample_date)) %>%
  arrange(sample_date, depth)

plt <- ggplot(tmp, aes(x = value, y = depth, color = dates)) +
  geom_path()
plt + 
  facet_wrap(~parameter, scales = 'free_x') +
  scale_y_reverse()

But we can't if we try to do it directly. Similar problems arise with ptdots() and ptsmooth()

plt <- ptlines(tmp,  value,  depth, dates,
              mapping = aes(lty = dep_long$parameter))
plt
plt + facet_wrap(~parameter)

This is going to be difficult to solve for ptsmooth(), since we are not actually plotting the original dataframe, but a derived one. That will probably require developing an new geom following ideas in "Extending ggplot" https://cran.r-project.org/web/packages/ggplot2/vignettes/extending-ggplot2.html. WE are actively seaking solutions for the other two functions.



ccb60/tdggraph documentation built on Dec. 19, 2021, 1:58 p.m.