knitr::opts_chunk$set( fig.align = 'center', fig.width = 4, fig.height = 3, collapse = TRUE, comment = "#>" )
"Profile" data is common in both limnology and oceanography. Profile data is data where some sort of water quality (or more rarely, biological) parameter is measured from the water surface to depth. Often, profile data is collected at regular intervals at the same location, producing a three dimensional (time by depth by value) data set.
Common examples of time by depth data include salinity, temperature and dissolved oxygen profiles, but the idea applies whenever a parameter is collected more or less simultaneously along a vertical line.
The goal of this package is to provide a simple interface to ggplot graphics,
with intelligent defaults to generate plots useful for this type of data.
In particular, profile data is usually drawn with a reversed vertical (depth)
axis, so that the surface (zero depth) is at the top of the figure, with depth
increasing towards the bottom of the figure. Similarly, it is common to show
zero depth on the plot, even though data collection starts at some finite
depth below the water surface.
The tdggraph package is principally designed to produce graphics from
season-long series of periodic profile observations. The three main graphics
functions create ggplot2 graphic objects with different strengths for
emphasizing different aspects of the data.
We load several "tidyverse" package individually here, to minimize
unnecessary dependencies in the package. Users can usually load them in one
step with library(tidyverse).
library(dplyr) library(tidyr) library(purrr) library(forcats) library(ggplot2) library(tdggraph)
We also create a factor version of the date to simplify examples.
data(dep_sonde) dep_sonde <- dep_sonde %>% mutate(txt_date = format(sample_date, format = '%m/%d'), txt_date = factor(txt_date), txt_date = fct_reorder(txt_date, sample_date))
tdggraph Principlestidyverse and ggplotThe functions in tdggraph are designed to integrate well with ggplot and dplyr
work flows.
The first argument to each plotting function is a source data frame. They can
therefore be integrated into a magrittr pipe, like the ggplot() All
graphic functions use data masking, so they will look for data passed as
function arguments within that data frame, before looking in the enclosing
environment. Variable names in the source data frame can therefore "mask"
the same variable name from the enclosing environment. This style of function
argument will be familiar to users of ggplot() and the tidyverse.
The output of each plotting function is of class ggplot, so users can add
additional graphic components to the plot as usual, adding layers, modifying
themes, changing axis labels, altering color schemes, transforming axes, and
more.
The default reversal of the Y axis can be "undone" by adding
scale_y_continuous() to the function output. this will generate a warning,
but the Y axis will be returned to "normal" with higher values towards the top
of the figure.
Users can pass additional aesthetic specifications as part of the
call to any of these functions. Any additional arguments passed to these
functions are passed directly to the underlying ggplot2 geoms, allowing
users to override aesthetic defaults, most of which are inherited from
ggplot(). Additional arguments are passed to geoms as follows:
ptlines() -> geom_path() ptdots() -> geom_dots() ptsmooth() -> geom_tile() Passing a data dependent aesthetic specification using aes() requires some
special handling. Data masking does not work within the supplementary call to
aes(), so users must pass arguments to aes() that refer to objects defined
in the enclosing environment. Usually, this just requires an explicit reference
to the source data frame (e.g., df$value, not just value). See the example
in the ptdots() "Filled Symbols" demo for an example.
If all data is available in the enclosing environment, you can pass NULL as the data frame parameter. This is sometimes helpful if you are using the functions inside another function, where you may not want to assemble a data frame.
In the current version of the package, all plotting functions take arguments in order of their role in the plot. The order follows the graphical logic of the plot, not the scientific logic. This is a potential source of errors and frustration, so it's worth explaining.
The order of the arguments for the plot functions is always:
data source;
variable that defines x location;
variable that defines y location;
variable that defines color.
From the scientific perspective, all these plots show depth, time, and a measured environmental variable. Those roles do not change, but the way they are depicted on a plot does.
The formal arguments for ptlines() are as follows:
ptlines(.dt, .val, .y, .grp,.sort = TRUE, ...)
Note that in this type of plot the x coordinate reflects the value of the data you are showing, so it comes first. The grouping variable defines the color, so it comes last.
The first few formal arguments for ptdots() and ptsmooth() are similar to
each other:
ptdots(.dt, .x, .y, .val, ...) ptsmooth(.dt, .x, .y, .val, .res_x = 0.5, .res_y = 2, y_grow_grid = TRUE, y_with_zero = TRUE )
The data source comes first, then the x coordinate (usually a time coordinate), then the y coordinate (usually depth) then the value that defines the color (the measured environmental variable).
ptlines() Demothe ptlines() function generates a line graph with vertical profiles shown as
lines. Different profiles are distinguished by color. This is an excellent
format to show subtle differences in profiles between a relatively modest number
of samples, as viewers can easily compare profiles against each other. It is
not as good if there are many samples, as a large number of lines begins to look
like spaghetti.
This function can also be used to plot profile data from multiple sites, as the logic of the plot works with qualitative, not quantitative grouping variables.
Data is preferentially found inside the data frame passed as the first argument. Arguments are in the order of x coordinate -- here a value , y coordinate -- usually the depth -- and a grouping variable -- often a date or a location. The grouping variable should evaluate to a factor or character vector, as it will always be used to divide data into separate profiles.
ptlines(dep_sonde, temp, depth, txt_date, .sort = FALSE)
tidyverse and ggplotThe function is designed to fit into a typical tidyverse work flow, for example, you can feed data into it via a pipe, and add components to the resulting ggplot object:
ptlines(dep_sonde, do, depth, txt_date, size = 1.5, lty = 2) + theme_minimal() + scale_color_brewer(type = 'qual', name = '') + ylab('Depth (m)') + xlab('Dissolved Oxygen (mg/l)')
Any supplementary aesthetics defined in mapping = aes() must refer to names
from the enclosing environment. See the example in the ptdots() "Filled
Symbols" demo for an example.
If all data is available in the enclosing environment, you can pass NULL as the data frame parameter. This is helpful if you are using the functions inside another function, where you may not want to assemble a data frame.
td <- dep_sonde$txt_date d = dep_sonde$depth ox = dep_sonde$do ptlines(NULL, ox, d, td, .sort = FALSE)
Under the hood this function uses geom_path() and not geom_line(), so
the order of the data matters. geom_path() connects points in the order of
their occurrence in the source data, while geom_line() internally orders data
according to the x coordinate before plotting.
The DEP sonde data is already sorted by date and depth, which is why we were able to ignore this detail up until now. We demonstrate what happens by resorting the data we just plotted. By sorting each component of the data to the same random order, we ensure we are plotting the same data (x,y,z) triplets, only they are presented in a new random order.
You get spaghetti:
rand_order <- order(rnorm(length(td))) td2 <- td[rand_order] d2 <- d[rand_order] ox2 <- ox[rand_order] ptlines(NULL, ox2, d2, td2, .sort = FALSE)
.sort Argumentit is a good practice to sort your data yourself before passing it to this function, but if you're in a hurry, there's a shortcut that does the job for you. You control this behavior with the `.sort' parameter, which is TRUE by default. But be forwarned: results may not quite be what you expected or hoped for.
ptlines(NULL, ox2, d2, td2, .sort = TRUE) + theme_minimal()
Sometimes this gets tricky. For example, suppose you want to symbolize
the data by month, not by date. You have multiple sampling dates within some
months. By default, geom_path() will connect all the dots in the order of
the source data or the data sorted by depth, so neither .sort = TRUE or
.sort = FALSE works quite as expected:
With .sort = TRUE, you get zigzags:
ptlines(dep_sonde, temp, depth, month, .sort = TRUE)
With .sort = FALSE, things look better, because the source data was already
sorted, but you get unnecessary connecting lines from the bottom of one profile
to the top of the next.
ptlines(dep_sonde, temp, depth, month, .sort = FALSE)
The solution is to pass a grouping variable. Note: this behaves strangely with
.sort = TRUE, so you should sort your data before passing it to ptlines().
ptlines(dep_sonde, temp, depth, month, .sort = FALSE, mapping = aes(group = dep_sonde$txt_date))
ptdots() DemoThe dots provide an alternative way of depicting similar data. Here, the
graphic emphasizes dates over the course of a season, at the expense
of making subtle differences in profiles more difficult to assess. Default
colors from ggplot2 are often a poor choice here.
ptdots(dep_sonde, sample_date, depth, do, size = 5) + theme_minimal() + scale_color_gradient2(name = 'Oxygen, (mg/l)', high = scales::muted("blue"), low = scales::muted("red"), mid = 'gray95', midpoint = 9)
As before, we can pass NULL as a data frame parameter if all data is available in the enclosing environment.
sd <- dep_sonde$sample_date ptdots(NULL, sd, d, ox, size = 5) + theme_minimal()
Any supplementary aesthetics defined in mapping = aes() must refer to names
from the enclosing environment. As a result, using filled point symbols. (R
point shapes 21 through 25) takes several additional steps:
Add mapping = aes(fill = ?) to the function call, pointing
to data in the enclosing environment, usually by qualifying the name of
the data frame explicitly with $.
Pass a shape argument selecting the filled symbol shape
Usually, pass a color argument to define the fixed dot outline color.
(color = 'gray50' is often a good starting point). If you don't specify
a single color, to override the normal color aesthetic, the outline color
will vary, which is probably not what you want.
Usually, you will want to supplement the call to this function with a
call to scale_fill_viridis_c(), scale_fill_distiller(), or
scale_fill_continuous() or scale_fill_gradient() etc. to define a fill
color scale that works better than the ggplot2 default.
ptdots(dep_sonde, sample_date, depth, do, size = 3, shape = 22, color = 'gray25', mapping = aes(fill = dep_sonde$do)) + theme_minimal() + scale_fill_gradient(name = 'Oxygen, (mg/l)', high = "paleturquoise1", low = scales::muted("darkgreen")) + xlab('') + ylab('Depth (m)')
ptsmooth() demoThis function produces a two dimensional interpolation plot. It can be thought
of as a smoothed out version of ptdots(). It is especially useful for
"eyeballing" seasonal patterns from periodic data. As before, default colors
are seldom the most informative. Currently, NAs replace any grid cells below
the lowermost (highest Y value) data points.
data(dep_sonde) plt <- ptsmooth(dep_sonde, sample_date, depth, temp, .res_x = 1, .res_y = .25, y_grow_grid = TRUE, y_with_zero = TRUE) + scale_fill_gradient(name = 'Oxygen, (mg/l)', high = scales::muted("blue"), low = scales::muted("red", l = 40)) plt
It is easy to overlay the locations of the original observations if you want to show viewers where you have actual observations
plt + geom_point(mapping = aes(sample_date, depth), data = dep_sonde, shape = 21, size = 3, fill = NA)
interpol() DemoThis function is a thin wrapper around base R's approx() function, which
implements simple linear interpolation. The major difference is that the return
value of interpol() is a data frame, not a list. The function is used "under
the hood" to generate graphics interpolated in two directions without clipping.
The basic idea for this function and its use in generating smoothed graphics for profile data were inspired by Dewey Dunnington's fishandwhistle.net blog.
dat <- dep_sonde[dep_sonde$month == 'Jul',] grid <- seq(0, 14, 0.75) interp <- interpol(dat$depth, dat$temp, .grid = grid) ggplot(dat, aes(temp, depth)) + geom_point(size = 3, col = 'red') + geom_point(mapping = aes(dep, ind), data = interp, shape = 3) + scale_y_reverse()
tmp <- dep_sonde %>% select(site, sample_date, depth, month, temp) %>% group_by(sample_date) %>% filter(depth == min(depth)) %>% ungroup(sample_date) grid <- seq(min(tmp$sample_date), max(tmp$sample_date), 7) interp <- interpol(tmp$sample_date, tmp$temp, .grid = grid, grow_grid = FALSE, na.rm = TRUE) ggplot(tmp, aes(temp, sample_date)) + geom_point(size = 3, col = 'red') + geom_point(mapping = aes(dep, ind), data = interp, shape = 3) + ggtitle('Fore River Site 09, July 2018')
interpol() includes an optional.name` argument, which can be useful when
assembling larger data frames through working with nested tibbles or functional
programming. For example, one might often want to interpolate within dates,
not between dates.
We calculate a grid to use for each date, then use a nested tibble to
interpolate within each date before using reduce() to generate a tidy
(long-form) tibble containing the results of the interpolation.
lims <- range(dep_sonde$depth) max_y = ceiling(lims[2]) min_y = floor(lims[1]) ygrid = seq(min_y, max_y, by = 0.5) profs <- dep_sonde %>% group_by(sample_date) %>% nest() %>% mutate(prof = map(data, function(dat) interpol(dat$depth, dat$do, .name = sample_date, .grid = ygrid))) profs <- reduce(profs$prof, bind_rows) %>% mutate(txt_date = factor(format(id, format = '%m/%d')), txt_date = fct_reorder(txt_date, id)) %>% rename(do = dep, depth = ind)
ggplot(profs, mapping = aes(x = do, y = depth)) + geom_point(aes(color = txt_date))+ geom_path(aes(color = txt_date)) + geom_point(data = dep_sonde, mapping = aes(x = do, y = depth))
Should return a data frame with NA in second variable. This is useful for use in functional programming, since it correctly propagates missing values.
t <- unique(dep_sonde$sample_date) %>% sort df <- tibble(x = t, y = NA_integer_) interp <- interpol(df$x, df$y, .grid = grid, with_zero = FALSE, grow_grid = FALSE, na.rm = TRUE) head(interp)
interpol_res() Demointerpol_res() is an only slightly more consequential extension of base R's
approx() function. It's primary claim to fame is that it allows the user to
specify a resolution for evenly spaced interpolations. (approx() allows you to
specify the number of interpolated points, but not their spacing). Like
interpol(), it outputs a data frame, not a list.
data(dep_sonde) dat <- dep_sonde[dep_sonde$month == 'Jul',] interp <- interpol_res(dat$depth, dat$temp, .res = 0.5) ggplot(dat, aes(temp, depth)) + geom_point(size = 3, col = 'red') + geom_point(mapping = aes(dep, ind), data = interp, shape = 3) + scale_y_reverse()
tmp <- dep_sonde %>% select(site, sample_date, depth, month, temp) %>% group_by(sample_date) %>% filter(depth == min(depth)) %>% ungroup(sample_date) interp <- interpol_res(tmp$sample_date, tmp$temp, .res = 5, with_zero = FALSE, na.rm = TRUE) ggplot(tmp, aes(temp, sample_date)) + geom_point(size = 3, col = 'red') + geom_point(mapping = aes(dep, ind), data = interp, shape = 3) + ggtitle('Fore River Site 09, July 2018')
Since the functions already define a "reversed" y axis, you need to be explicit altering the y axis again. Sometimes this leads to unexpected results.
For example, when you alter the limits of the y axis, you need to provide
the scale limits in reversed order (the way they are plotted). This is true
both for scale_y_reversed() and for coord_cartesian(). Note that these two
commands do not quite generate identical plots, as coord_cartesian() will not
prune data.
plt <- ptdots(dep_sonde, sample_date, depth, do, size = 3) plt + scale_y_reverse(limits = c(10, 0)) plt + coord_cartesian(ylim = c(10,0))
facet_wrap() or facet_grid()Because of some initial programming choices, none of the functions currently pass the source data frame to the ggplot geoms internally. This underlies some of the quirks of using the functions that have already been described. However, it also means the functions can not be used with facets, even though it is natural to want to do so.
We create a selected data set to show what happens with each of the functions
dep_long <- dep_sonde %>% select(-ph, -chl_a_sonde, -turbidity, -turbidity_cens) %>% pivot_longer(temp:do, names_to = 'parameter', values_to = 'value')
ptlines()It is easy to generate a faceted graphic like our ptlines() function output
by hand:
tmp <- dep_long %>% mutate(dates = factor(sample_date)) %>% arrange(sample_date, depth) plt <- ggplot(tmp, aes(x = value, y = depth, color = dates)) + geom_path() plt + facet_wrap(~parameter, scales = 'free_x') + scale_y_reverse()
But we can't if we try to do it directly. Similar problems arise with
ptdots() and ptsmooth()
plt <- ptlines(tmp, value, depth, dates, mapping = aes(lty = dep_long$parameter)) plt
plt + facet_wrap(~parameter)
This is going to be difficult to solve for ptsmooth(), since we are not
actually plotting the original dataframe, but a derived one. That will
probably require developing an new geom following ideas in "Extending ggplot"
https://cran.r-project.org/web/packages/ggplot2/vignettes/extending-ggplot2.html.
WE are actively seaking solutions for the other two functions.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.