knitr::opts_chunk$set(fig.align = 'center',
                      fig.width = 5, fig.height = 4,
                      collapse = TRUE, comment = "#>")

Load libraries

library(tidyverse)
library(rlang)

Load Data

sonde_data <- read_csv('sonde_data.csv',
                       col_types = cols(
                         site_name = col_character(),
                         site = col_character(),
                         dt = col_date(format = ""),
                         month = col_character(),
                         year = col_double(),
                         time = col_time(format = ""),
                         hour = col_double(),
                         depth = col_double(),
                         temp = col_double(),
                         salinity = col_double(),
                         ph = col_double(),
                         pctsat = col_double(),
                         do = col_double(),
                         chl_a_sonde = col_double(),
                         turbidity = col_double(),
                         turbidity_cens = col_logical())) %>%
              rename(sample_date = dt)

Basic Graphics Concepts

Dissolved Oxygen Example

One way that profiles can be reported is on a single panel, with separate profiles (from different dates) shown as separate lines on a graph.

If we follow the usual convention of placing the depth value on the Y axis, with the water's surface symbolically up, we need to:
1. Sort the data in order of depth;
2. Use geom_path();
3. specify scale_y_reverse(); and
4. Select a format for the dates.

Otherwise, this is a fairly simple graphic format, and arguably not worth the effort to draft a function to encapsulate.

sonde_data %>%
  filter(site == 'FR09', year == 2018) %>%
  arrange(sample_date, depth) %>% 
  mutate(txt_date = format(sample_date, format = '%m/%d')) %>%
ggplot(aes(do, depth, color = factor(txt_date))) +
  geom_path() +
   scale_y_reverse() +
  scale_color_discrete(name = '') +
  theme_minimal() +
  ylab('Depth (m)') +
  xlab('')

Draft ptlines() Function

We create a simple function to recreate a graph showing depth profiles. The function does nothing that can' be done with a couple of lines of ggplot2 code. It is of little real value, except as a platform to develop programming parameter passing and error conventions for the package via a very simple example.

We include r's dots here in the function definition. It's not yet clear how dots would be used. Most modifications of the graphics can occur in a ggplot2 workflow.

From a design point of view, we need to decide whether to send dots on to ggplot(), aes() or geom_path(). We make the decision that the most likely use would be to pass additional aesthetics of descriptors into the geom.

Another design issue is whether we want to allow a call without a color parameter, and if so whether we need a group parameter -- but at that point the function is getting too complex for an intial draft of a package.

ptlines <- function(.dt, .x, .y, .val, ...) {
  # These are ugly argument checks, since they don't provide error messages.
  stopifnot(is.data.frame(.dt))

  # We want to be able to accept arguments as unquoted names, quoted names,
  # or expressions that evaluate to either. 
  ind <- ensym(.x)
  dep <- ensym(.y)
  col = ensym(.val)

  # Check that these are data names from the data frame
  # These are not strictly necessary, as other errors will be triggered if we
  # don't catch catch this here.  The main value of an explicit error check
  # may be to prevent inadvertant pulling in data from the calling environment
  # rather than the dat parameter.
  # Do we even what to make this check?  Will it block certain use styles where 
  # the user is passing data assembled outside of the dataframe for one or more
  # of the parameters? Should we be checking lengths and types instead?

  # stopifnot(as_string(ind) %in% names(.dt))
  # stopifnot(as_string(dep) %in% names(.dt))
  # stopifnot(as_string(col) %in% names(.dt))

ggplot(data = NULL, aes(eval_tidy(ind, .dt), 
           eval_tidy(dep, .dt), 
           color = eval_tidy(col, .dt))) +
  geom_path(...) +
  labs(x = as_string(ind), y = as_string(dep), color = as_string(col)) +
  scale_y_reverse()
}

Note that this function can fit into a typical tidyverse ggplot2 workflow.

tmp <- sonde_data %>%
  filter(site == 'FR09', year == 2018) %>%
  arrange(sample_date, depth) %>% 
  mutate(txt_date = format(sample_date, format = '%m/%d')) %>%
  mutate(txt_date = factor(txt_date))

ptlines(tmp, do, depth, txt_date) +
  theme_minimal()

We can pass additional aesthetic parameters to modify the lines.

ptlines(tmp, do, depth, txt_date, lty = 3, size = 1) +
  theme_minimal() +
  scale_color_discrete(name = '')

Error messages are fairly informative even without special handling.

ptlines(tmp, do, bicycle, txt_date) +
  theme_minimal() +
  scale_color_discrete(name = '')

But if we pull in a symbol from the calling environment, the error is less informative. This error is triggered a couple of function calls down, in the scale_y_reversed() call.

bicycle <- 'two wheels'
ptlines(tmp, do, bicycle, txt_date) +
  theme_minimal() +
  scale_color_discrete(name = '')

Temperature Example

One might also want a plot that shows observed "profile" values, symbolized by color, against a grid of dates and depths.

Again, a plot like this practically writes itself in ggplot2. The key decisions are largely aesthetic, and no different than for any other point plot. A special purpose function could save a bit of coding by encapsulating intelligent defaults, but it won't (and should not) alter the basic issues faced by a designer. Two issues -- point design and color palette are especially important.

sonde_data %>%
  filter(site == 'FR09', year == 2018) %>%
ggplot(aes(sample_date, depth, fill = temp)) +
  geom_point(size = 4, shape = 21, color = 'grey50') +
  scale_fill_distiller(name = 'Temperature (C)', 
                       type = 'div',
                       palette = 'RdBu') +

  theme_minimal() +
  ylab('Depth (m)') +
  xlab('') +

  scale_y_reverse() +

  theme(legend.position = 'bottom',
        legend.title = element_text(size = 12),
        legend.text =  element_text(size = 9),
        axis.ticks.length.x = unit(0, 'in')) +
  guides(fill = guide_colourbar(title.position="top", barheight = .5))

Draft ptdots() Function

This is largely identical to the previous function, with different symbolization.

From a design point of view, however, there are a few more issues to address. We again pass dots through the the main geom, but this function will almost always need a custom color scale. We may eventually want to add function parameters for the color scales, with an intelligent default.

ptdots<- function(.dt, .x, .y, .val, ...) {
  # These are ugly argument checks, since they don't provide error messages.
  stopifnot(is.data.frame(.dt))

  # We want to be able to accept arguments as unquoted names, quoted names,
  # or expressions that evaluate to either. 
  ind <- ensym(.x)
  dep <- ensym(.y)
  col = ensym(.val)

ggplot(data = NULL,
       aes(x = eval_tidy(ind, .dt), 
           y = eval_tidy(dep, .dt), 
           color = eval_tidy(col, .dt))) +
  geom_point(...) +
  labs(x = as_string(ind), y = as_string(dep), color = as_string(col)) +
  scale_y_reverse()
}
tmp <- sonde_data %>%
  filter(site == 'FR09', year == 2019) %>%
  arrange(sample_date, depth)

ptdots(tmp, sample_date, depth, temp, size = 3)

I like the look of filled point symbols in this setting. We might want to be able to pass color or fill aesthetics in the original function call. A work around for now is to handle this via parameters passed through to geom_point(). Note that this overrides the default aesthetic assignments.

Notice also that ptdots() does not know to look for temp in the source dataframe, so it needs to be specified explicitly. Presumably this could be corrected by capturing the call and evaluating all arguments (not just the three required ones) in the context of the data parameter. We do not tackle that here.

ptdots(tmp, sample_date, depth, do, 
       shape = 21, aes(fill = tmp$temp), col = 'grey50',
       size = 3,) +
  scale_fill_distiller(name = 'Temperature (C)', 
                       type = 'div',
                       palette = 'RdBu')

Towards a Depth-Time Plot

Often, profile-time data is interpolated to generate a smoothed plot. The smoothed plot looks great, and also helps reveal patterns. The disadvantage is that a smoothed plot may mislead the viewer into thinking you have more data than you actually do.

The ggplot2 functions that create two dimensional graphs of three dimensional data are somewhat confusing to work with. They generally require a regularly spaced "grid" of x and y values. Different functions that superficially do similar things (geom_contour(), geom_contour_filled() and geom_tile(), expect different aesthetics.

So, to create a depth-time plot smoothed in both directions (dates and depths), we need to interpolate from the data to produce equally spaced values.

There is a nice write-up of the logic with simple R code on Dewey Dunnington's fishandwhistle.net blog.

Similar points are made in a response to a Stack Overflow question here: https://stackoverflow.com/questions/63792174/get-geom-contour-geom-tile-to-work-with-my-data The SO answer points towards the akima package, which provides several bivariate interpolation functions, which we could use directly instead of relying on rolling our own bivariate interpolation functions.

Interpolation

R includes a linear interpolation function, approx(). It takes vector values for x and y, and a vector of desired locations (x values) for the estimates. It returns a list with x and y components. Dunnington used this function in his blog, and it is a reasonable first order choice.

We use it here as well, at least to start package development.

But linear interpolation is just one of quite a few possible approaches to take. Many potential smoothers exist in the R universe, including sptlines, lowess and gam smoothers. Any of these could be used to generate estimated values where data is absent.

Formal GAM modeling has the advantage of providing a ready way to fit a two dimensional smooth, with independent control of degree of smoothing along two axes. However, GAM smoothers (like many smoothers) will "underfit" extreme values.

Basic Logic

We can demonstrate the logic of interpolation in one dimension by interpolating temperature data along each depth profiles.

Limnological or oceanographic data are sometimes collected at (nominally) discrete depths, with the same depths used for all vertical profiles. This is especially common for data based on discrete samples collected at depth. However, since the advent of automated "CTD" sensors, "sondes" and data loggers, data is just as likely to be collected at regular time intervals, producing irregular depths that will vary from sampling event to sampling event.

Interpolation along depths can make data from subsequent sampling events easier to compare. Although care must be taken to evaluate whether rapid changes in environmental parameters with depth could be obscured, or thin layers with distinct chemical, biological, or optical properties missed. This can be a significant problem with data from stratified waters, including high latitude lakes, meromictic lakes, and estuaries.

tmp <- sonde_data %>% 
  select(site, sample_date, depth, month, year, temp) %>%
  filter(site == 'FR09', year == 2017, month == 'Jul')
interp <- as_tibble(approx(tmp$depth, tmp$temp, seq(0, 13, by = 0.25 )))
interp

Note that here we are predicting TEMPERATURE based on DEPTH, so the default names x and y are reversed compared to how these values are usually plotted, with depth along the Y axis.

ggplot(tmp, aes(temp, depth)) +
  geom_point(size = 3) +
  geom_point(mapping = aes(y,x), data = interp, shape = 3) +
  scale_y_reverse() +
  #ylab('Depth (m)') +
  #xlab('Temperature (C)')
rm(tmp, interp)

Interpolation Function

We need a function that can interpolate from the surface (depth = 0) to maximum depth. This is just a thin wrapper around approx that handles a little housekeeping setting up a grid over which we want predictions, if one is not provided in the call.

The parameter names .y, .x, and grid are mnemonics, to remind you that the function will estimate the dependent variable .y at the value of .grid based on the [.x, .y] pairs. .grid helps remind you of the intent of the function to provide values at equally spaced values. .res provides an alternative

I avoided using ".x" or ".y", or alternatively ".y" and ".date" because this is a more general function. For the same reason, I omit tests for positive depths. Who knows how this will get used?

If we want to generalize to use more sophisticated smoothers, this is the function we would need to modify, presumably by passing an expression for an alternate smoother or assembling a call from a function name and list of parameters.

The differences between this code and the code in Dunnington's blog mostly reflect checking assumptions about arguments and providing a bit more flexibility in how the function could be used.

The calling conventions here are a bit simpler, as we assume all parameters are data vectors. Most inconsistencies will trigger appropriate errors. Herr we use the dots to pass additional control variables through to the approx() function.

interpol <- function(.x, .y, .grid = NA, .res = NA,
                     .id = '', ...) {
  # Check parameters. We allow date or time coordinates on the horizontal axis
  stopifnot(is.numeric(.x) | inherits(.x, 'Date') | inherits(.x, 'POSIXt'))
  stopifnot( is.numeric(.y))
  if ((missing(.grid )) & (missing(.res))) {
    stop("Must provide either a .grid of new positions at which to interpolate ",
         "or a .res parameter specifying desited interpolation .res.")
  }

  #TODO:  Add error checks to see if the .grid parameter makes sense

  # Check for missing values in the X (`.x`) value -- these are not allowed 
  # with `approx()`.  Missing values in the Y values will be handled according
  # to the setting of `na.rm =`.
  if(any(is.na(.x))) {
    nm = names(.x)
    stop('NAs not permitted in the independent variable. ',
         'Remove NAs from ',
         if_else(nchar(nm) > 0, nm, 'the independent variable'),
         ' and try again.' )
  }

  extra_args = rlang::list2(...)

  # We pass ggrid to the `xout` parameter of `approx()`
  # so passing `xout` or `n` to this function is an error.
  if ('xout' %in% names(extra_args) | 'n' %in% names(extra_args)) {
    stop("An `xout` or `n` parameter was found.  Specify location or ",
         ".res of interpolated values with `.grid` or `.res`.")
  }

  # If we were not given a grid, we need to develop one based on the data
  # and the `.res` parameter.
  if (missing(.grid)) {
    # What is maximum depth for the interpolation?
    # from a design point of view, it may be better to allow this parameter to 
    # be passed to this function, e.g., for consistency across dates.
    max.x = max(.x, na.rm = TRUE)

    # Round max depth up to a whole number for interpolation.  This allows
    # the grid to end on a whole number depth, which will usually be right.
    # Several items, including whether to round, the number of decimals to 
    # round to, and the tolerance for testing equality, might all be better
    # if the user were given direct control via function parameters.
    if ( ! abs((max.x - round(max.x)) < 0.001))
      max.x <- ceiling(max.x)

    # Generate the grid
    # TODO: Consider if there is a way to pass the grid back out of the function 
    # for use in other analyses.
      ggrid = seq(0, max.x, by = .res)
    }
    else {
      ggrid = .grid
    }

  vals <- as_tibble(approx(x = .x, y = .y, xout = ggrid, ...)) %>%
    rename(ind = x, dep = y) %>%
    mutate(id = .id) %>%
    relocate(id)
  return(vals)
}

Testing the Function

tmp <- sonde_data %>% 
  select(site, sample_date, depth, month, year, temp) %>%
  filter(site == 'FR09', year == 2018, month == 'Jul')
interp <- interpol(tmp$depth, tmp$temp, .res = 0.5, na.rm = TRUE)

ggplot(tmp, aes(temp, depth)) +
  geom_point(size = 3) +
  geom_point(mapping = aes(dep, ind), data = interp, shape = 3) +
  scale_y_reverse() +
  ggtitle('Fore River Site 09, July 2018')

Pass Other Parameters to approx()

Here's a list of available parameters for 'approx' that don't conflict with our usage conventions here:

'method': c("linear","constant"). while "constant" interpolation is probably not useful in this context, who knows?

'yleft': What to return below lowest .x value?

'yright' What to return above highest .x value?

'rule': An alternative to yleft and yright. What rule to apply to define values outside? 1 = NAs, 2 = extrema. This is probably worth experimenting with for some graphics.

'f': a numeric value that indicates where between two adjacent values to place a "constant" interpolation. 0 implies lowest value. 1 the highest, and 0.5 the midpoint.

'ties': what to do if we have tied X values? Default is to use the mean y value associated with all tied x values.

'na.rm' # As usual, and likely to need setting by users.

interp <- interpol(tmp$depth, tmp$temp, .res = 0.5,
                           na.rm = FALSE, method = 'constant', rule = 2)

ggplot(tmp, aes(temp, depth)) +
  geom_point(size = 3) +
  geom_point(mapping = aes(dep, ind), data = interp, shape = 3, color = 'orange') +
  scale_y_reverse() +
  ggtitle('Fore River Site 09, July 2018')

Apply to Multiple Dates

With an interpolation function available, we need to roll that into a function that will correctly interpolate within dates, but not across dates. We again use .x and .y for the horizontal (usually date) and vertical (usually depths) coordinates.

interpol_along_depths <- function(.dt, .x, .y, .val,
                             .id = '', .res = 0.5) {

  # These are ugly argument checks, since they don't provide nice error messages.
  stopifnot(is.data.frame(.dt))
  stopifnot(length(.res) == 1)

  ddate  <- ensym(.x)
  ddepth <- ensym(.y)
  vvalue <- ensym(.val)

  # Create internal dataframe
  # This is wasteful of memory for large data sets but simplifies coding.
  df <- tibble(xx = eval_tidy(ddate, .dt), 
               yy = eval_tidy(ddepth, .dt),
               zz = eval_tidy(vvalue, .dt))

  # Establish the grid for all dates, despite tides, etc.
  # Since we are headed for a single plot, we set limits for 
  # depth based on the deepest observations from any date.
  # `approx()` will give NA outside the range of observations.

  max_dpth <- max(df$yy)
  # Round max depth up to a whole number for interpolation
  max_dpth <- if_else( (abs(max_dpth - round(max_dpth)) < 0.001),
                      max_dpth, 
                      round(max_dpth + 0.5))
  grid = seq(0, max_dpth, .res)

  # now, we work through each date.  This returns a dataframe for each date.
  # note that we use the `name` parameter to the interpol function to remember
  # the date for each data set.
  prof <- df %>%
    group_by(xx) %>%
    nest() %>%
    mutate(prof = map(data, function(dat) interpol(dat$yy, dat$zz,
                                                  .grid = grid,
                                                  .id = xx)))
  # And we use `reduce()` and `bind_rows()` to combine them to one dataframe
  r = reduce(prof$prof, bind_rows)
  return(r)
}
tmp <- sonde_data %>% 
  select(site, sample_date, depth, month, year, temp) %>%
  filter(site == 'FR09', year == 2018)

test <- interpol_along_depths(tmp, sample_date, depth, temp, .res = 0.5)
ggplot(test, aes(id, ind, color = dep)) +
  geom_point(size = 3) +
  scale_y_reverse() +
  scale_color_gradient2(name = 'Temperature (C)', 
                        midpoint = 15, 
                         high = scales::muted("red"), 
                         low = scales::muted("blue")) +
  ylab('Depth (m)') +
  xlab('Date') +

  theme(legend.position = 'bottom',
        legend.title = element_text(size = 12),
        legend.text =  element_text(size = 9),
        axis.ticks.length.x = unit(0, 'in')) +
  guides(color = guide_colourbar(title.position="top", barheight = .5))

Dates

Now that we have a grid in one direction, we can generate a grid the other way, depth by depth, across dates.

We can use essentially the same logic as before, this time along depths.

We assume this function is being passed a data frame with three items: dates (`.x), depths (.y) and values (.val). It has already gone through interpolation by depth, so it has regularly spaced values along the y axis.

It will return a (larger) dataframe that fills in values between dates

Apply to Multiple Dates

all_dates_fill <- function(.dt, .x, .y, .val, .res = 1) {
  #browser()
  # These are ugly argument checks, since they don't provide nice error messages.
  stopifnot(is.data.frame(.dt))
  stopifnot(length(.res) == 1)

  # Read in names of variables 
  ddate = ensym(.x)
  ddepth <- ensym(.y)
  vvalue <- ensym(.val)

  # Create internal dataframe
  # This is wasteful of memory for large data sets but simplifies coding
  df <- tibble(xx = eval_tidy(ddate, .dt), 
               yy = eval_tidy(ddepth, .dt),
               zz = eval_tidy(vvalue, .dt))

  # Now, we work through each depth
 int_grid <<- df %>%
    group_by(yy) %>%
    nest() %>%
    mutate(int_grid = map(data, function(dat)
                           interpol(df$xx,  dat$zz, 
                                    .res = .res,
                                         .id = yy)))
   rr <-  reduce(res$res, bind_rows)

   rr <- rr %>%
     rename(x = ind, y = id, val = dep)
   return(rr)
}

Test

tt <- all_dates_fill(test, id, ind, dep, .res = 5 )
tt

scale_fill_gradient2('pi0', low = "blue", mid = "white", high = "red", midpoint = 0)

to make plot colours directly comparable add consistent limits to each plot:

scale_fill_gradient2('pi0', low = "blue", mid = "white", high = "red", midpoint = 0,

Preliminary Graphic Output

tt %>%
  filter(! is.na(value)) %>%

ggplot(aes(i_date, depth, fill = value)) +
  geom_tile() +
  geom_point(data = test,
             mapping = aes(id, depth, color = value), 
             shape = 21, size = 1, color = 'grey45') +
  scale_y_reverse() +
  scale_fill_gradient2(name = 'Temperature (C)',
    midpoint = 15, 
    high = scales::muted("red"), 
    low = scales::muted("blue"),
    mid = 'gray95'
  ) +
  ylab('') +
  xlab('') +
  theme_cbep(base_size = 12) +
  #coord_cartesian(expand = FALSE) +

  theme(legend.position = 'bottom',
        legend.title = element_text(size = 12),
        legend.text =  element_text(size = 10),
        axis.ticks.length.x = unit(0, 'in')) +
  guides(fill = guide_colourbar(title.position="top", barheight = .5))


ccb60/tdggraph documentation built on Dec. 19, 2021, 1:58 p.m.