library(dplyr) library(magrittr) library(purrr) library(looplot)
Nested loop plots are a data visualisation for tabular data, which contains
or is dependent on several key parameters.
A basic example is data coming from controlled experiments, in which
measurments are performed under pre-specified conditions. The results of such
an experimental study are then stored along with the pre-specified conditions
that gave rise to them (the key parameters). Often, such data is then
presented in form of a table to make the performed measurements comparable
across the variety of conditions. However, if there are many different
experimental constraints, the resulting table easily becomes unintelligible.
In such cases, a visual display can aid in the interpretation of the complex
experimental data.
In statistical research, the equivalent of controlled experiments are
simulation studies, which will be the focus of this vignette.
@NestedLoop suggested the use of nested loop plots to facilitate transparent
reporting of the results from such studies.
In the following we briefly motivate statistical simulation studies,
explain the basic layout of nested loop plots and demonstrate the use
of the looplot
package to create such visualisations.
Simulation studies in statistics or other methodological research fields are controlled experiments used to assess the properties of algorithms, gauge the performance of statistical or machine learning models, and gain insights into complex phenomena which are not readily understood analytically. While simulation studies by nature are simplifying the "true" underlying mechanisms of interest, they are useful because they allow the complete specification of the "ground truth" to which comparisons can be made and the experimental conditions are fully under control of the experimenter. Such studies typically consist of three major components:
All three together make up what we will call the simulation study design.
The data generating mechanism usually simulates (i.e. creates) data according to pre-defined parameters, e.g. the number of variables (i.e. columns of the data matrix) to be generated, their correlation structure or the number of observations (i.e. rows of the data matrix). These parameters will be called design parameters in the following, and the unique combinations of their values define specific simulation scenarios. Usually, the data generation involves some uncertainty or noise when creating the data, to mimic the uncertainty when data is obtained through measurements in the real world or sampled from a population. Therefore, each simulation scenario can be conducted repeatedly to remove the effect of sampling variability.
The methods or models to be evaluated in the study are then applied to the datasets generated by the data generating mechanism and the evaluation criteria are computed, e.g. some measure of deviation from the ground truth. These are subsequently averaged to obtain performance measures for each method for a given simulation scenario.
Nested loop plots are a data visualisation for tabular data which is dependent on several design parameters. Their visual appearance is similar to trellis plots. @NestedLoop suggested nested loop plots for displaying the results accross the different scenarios of a simulation study. The key idea is to display the design parameters along with the results for each specific scenario they define and to arrange the display in a meaningful way to facilitate the clear distinction of patterns within the results. As suggested in the original publication, design parameters are drawn as step functions below the results. This is especially useful when the experimenter is conducting the simulation study and is looking for interesting ways to summarize the results. The plots also provide a good high level overview of the study suitable for a publication.
A challenge when using nested loop plots is their complexity and information density. For these reasons it is also often necessary to use large figure sizes to accurately display all of the details appropriately. It is not straightforward to design a good looking nested loop plot with clear layout and easily readable display of the data. Manual tweaking of the display is often necessary. That is the reason why this package exists to streamline the creation of nested loop plots and making them more accessible.
In short, nested loop plots work well for many possible simulation study designs, in particular if:
looplot
packageThis package is based on ggplot2
and
offers several functions which aid in the process of creating nested loop
plots with facetting.
The starting point for the looplot
package is a dataframe comprising
the results for a single performance measure from the simulation study,
very similar to how the results could be presented in a table. In fact, one
could think of nested loop plots as a more intuitive, visual way to present
such tabular data.
The dataframe should contain:
We will give an example of such a dataframe in the examples below.
The user then defines the arrangement of the data within the nested loop plot
and how the data should be displayed. There are two ways to work with the
looplot
package: one with basic options for users who are less familiar
with the underlying ggplot2
framework, and the other for users who
require modular access to specific parts of the plot to fine-tune the display.
In either case, the resulting plot object is a ggplot2
object and can
therefore be edited using the functions from that package. Furthermore,
many of the options have similar names and behaviour as comparable options
in ggplot2
, so some familiarity with the concepts is helpful to make the
most out of this package.
We will give an overview of many of the available options by demonstrating
several aspects of the looplot
package in the following examples.
This vignette makes use of the tidyverse
environment of packages and specifically requires the dplyr
,
purr
and magrittr
packages, besides the looplot
package
to be available. Please refer to
the R environment used to create this vignette for
detailed information.
We generate an artificial example dataframe representing the output of a potential,
actual simulation study which features a fully factorial design with 4 design
parameters (samplesize, param1, param2, param3
). For each combination of the
values of the parameters we generate some artificial data that depends on these values.
Normally, these would correspond to the perfomance measures of the methods
which are evaluated in the study (method1, method2, method3
).
The resulting input dataframe for the nested loop plots is shown below.
It encodes in each line the simulation design parameters for one specific
simulation scenario, as well as the (summarised) results for each of the
methods of interest. Note that the design parameters do not need to be
numeric, but can be ordinal or even categorical.
set.seed(14) params = list( samplesize = c(100, 200, 500), param1 = c(1, 2), param2 = c(1, 2, 3), param3 = c(1, 2, 3, 4) ) design = expand.grid(params) # add some "results" design %<>% mutate(method1 = rnorm(n = n(), mean = param1 * (param2 * param3 + 1000 / samplesize), sd = 2), method2 = rnorm(n = n(), mean = param1 * (param2 + param3 + 2000 / samplesize), sd = 2), method3 = rnorm(n = n(), mean = param1 * (param2 + param3 + 3000 / samplesize), sd = 2)) knitr::kable(head(design, n = 10))
The key elements of the original nested loop plot as suggested in @NestedLoop are
the display of the results of each individual method and simulation scenario,
as well as the display of the design parameters by virtue of plotting step
functions representing the values of the parameters.
In the looplot
package a lot of emphasis is given to the aspect of facetting,
to further divide the plot into small panels and make the plot visually less
cluttered. Furthermore, facetting also allows to easily zoom into specific
design parameter combinations and only drawing panels for specific parameter
combinations. Each panel then stays the same, so the results can still be
easily compared accross different panel configurations.
Furthermore this package uses the idea of "smallest plottable units" (spu),
which are defined as all results for simulation scenarios for which only a
single design parameter changes (drawn on the x-axis of the plot),
while all other design parameters remain at a fixed value. By separating such
spus from each other, meaningful subgroups of the results can be visually
identified and represented.
When designing a layout for a nested loop plot with facets it is often helpful to think of the context of the simulation study and the intended meaning of the simulation parameters. Do some of them have a natural ordering? Are some of them encoding situations with largely different results? Which of the parameters should define the facet grid and which are merely displayed in each panel as step function?
In our artificial dataset we have the parameter samplesize
which has a natural
order and is well suited to define an x-axis along which results can be
arranged.
The design is heavily influenced by param1
, which only has two categories
and therefore ideally defines either the columns or the rows of the
facetted plot.
The role of the other parameters is less clear and their influence on the
display of the results can be easily changed in the looplot
functions to
interactively get a good understanding of the study results.
The basic plot interface of the package is a single function with a lot of parameters. This might be very intimidating, so this vignette goes through a lot of them step by step.
The most important settings are to define the x-axis of the plot, the facet grid layout and adapt the basic parameters for plotting the parameter step functions to ensure a proper display. This is hard to automate, so manual tweaking is required to get a nice visualisation. Usually this can be achieved with the following parameters:
x
, grid_rows
, grid_cols
and steps
arguments by passing column names of the dataframe as strings.steps_y_height
. We also define the top-most value of
the step functions via steps_y_base
.Further details of the plot which often need to be taken care of:
x_name
and y_name
.spu_x_shift
(spu stand for "smallest plottable" unit as outlined above). steps_values_annotate
and adjust the font size
of the annotations appropriately for the desired figure size via steps_annotation_size
.hline_intercept
.y_expand_add
. Note that
this argument does not affect the y-axis data limits but expands the axis.
For more details on this functionality, see the documentation of the
adjust_ylim
function.ggplot2
package by adding custom theme
arguments (add_custom_theme
). Many of the plot object manipulations could also be achieved by directly
manipulating the resulting plot object and are incorporated into the
nested_loop_plot
function mainly for convenience or users with little
familiarity of the ggplot2
framework.
p = nested_loop_plot(resdf = design %>% rename(`beta[LY]` = param1), x = "samplesize", steps = "param3", grid_rows = "param2", grid_cols = "beta[LY]", steps_y_base = -10, steps_y_height = 5, x_name = "Samplesize", y_name = "Error", spu_x_shift = 75, steps_values_annotate = TRUE, steps_annotation_size = 2.5, hline_intercept = 0, y_expand_add = c(10, NULL), post_processing = list( add_custom_theme = list( axis.text.x = element_text(angle = -90, vjust = 0.5, size = 8) ) )) print(p)
Once presented with such a plot, several patterns clearly emerge from the picture:
param1
has a strong influence on the data - if it has value
2, then the error generally increases compared to scenario with value 1.param3
there seems to be a slight trend towards higher error. While this is only an artificial example, detecting such patterns is one of the goals of a well designed simulation study and could benefit from display with nested loop plots.
The layout of the plots can be easily changed as shown in the following examples. Most of these require manual tweaking of the parameter step functions.
p = nested_loop_plot(resdf = design, x = "samplesize", steps = c("param2", "param3"), grid_rows = "param1", steps_y_base = -10, steps_y_height = 3, steps_y_shift = 3, x_name = "Samplesize", y_name = "Error", spu_x_shift = 75, steps_values_annotate = TRUE, steps_annotation_size = 2.5, hline_intercept = 0, y_expand_add = c(10, NULL), post_processing = list( add_custom_theme = list( axis.text.x = element_text(angle = -90, vjust = 0.5, size = 8) ) )) print(p)
p = nested_loop_plot(resdf = design, x = "samplesize", steps = c("param2", "param3"), grid_cols = "param1", steps_y_base = -5, steps_y_height = 1, steps_y_shift = 3, x_name = "Samplesize", y_name = "Error", spu_x_shift = 75, steps_values_annotate = TRUE, steps_annotation_size = 2.5, hline_intercept = 0, y_expand_add = c(10, NULL), post_processing = list( add_custom_theme = list( axis.text.x = element_text(angle = -90, vjust = 0.5, size = 8) ) )) print(p)
These do not use facetting but only draw the design parameters as step functions. Furthermore, in the original publication of @NestedLoop, the authors use step functions to draw the results as well.
The advantage of not using facetting is that the steps are drawn only once, wasting less screen space to repeated information. A disadvantage is the lack of visual distinction between the blocks of the data, possibly making it harder to spot patterns.
p = nested_loop_plot(resdf = design, x = "samplesize", steps = c("param1", "param2", "param3"), steps_y_base = -5, steps_y_height = 1, steps_y_shift = 3, x_name = "Samplesize", y_name = "Error", spu_x_shift = 200, steps_values_annotate = TRUE, steps_annotation_size = 2.5, hline_intercept = 0, y_expand_add = c(10, NULL), post_processing = list( add_custom_theme = list( axis.text.x = element_text(angle = -90, vjust = 0.5, size = 5) ) )) print(p)
By subsetting the data, one can easily "zoom" into specific panels of the
nested loop plot. We use dplyr::filter
to subset the data in the following
example.
design_subset = design %>% filter(param1 == 1) p = nested_loop_plot(resdf = design_subset, x = "samplesize", steps = c("param2", "param3"), grid_rows = "param1", steps_y_base = -3, steps_y_height = 1, steps_y_shift = 1, x_name = "Samplesize", y_name = "Error", spu_x_shift = 75, steps_values_annotate = TRUE, steps_annotation_size = 2.5, hline_intercept = 0, y_expand_add = c(5, NULL), post_processing = list( add_custom_theme = list( axis.text.x = element_text(angle = -90, vjust = 0.5, size = 8) ) )) print(p)
Further options and advanced usage of this package are demonstrated in the vignette "Annotated gallery of examples".
sessionInfo()
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.