knitr::opts_chunk$set( collapse = TRUE, comment = "#>", fig.width = 5, fig.height = 5 )
library(corrmorant) library(dplyr)
corrmorant is a ggplot2 extension that allows to create scatterplot matrices and correlation plots based on a slightly modified ggplot2 syntax. The package both offers a simple standard function for creating scatterplot matrices with reasonable display settings using a minimum number of arguments (corrmorant()
) and the possibility to create plots layer by layer using any kind of user-specified geoms based on ggcorrm()
. Moreover, the package offers a large number of geoms and stats that automate common tasks in the display of scatterplot matrices.
The corrmorant()
function is a simple wrapper function around the more complex gcorrm()
function that can be used to create first, simple plots of correlation matrices. Currently, four different styles are available - "blue_red" (the default), "light", "dark", and "binned", here illustrated based on the drosera
dataset accompanying the package:
``` {r, eval=FALSE}
corrmorant(drosera, style = "light")
corrmorant(drosera, style = "dark")
corrmorant(drosera, style = "blue_red")
<img src="corrmorant_examples.png" width=700/> It is clearly visible that none of the styles is very useful in the case of this dataset as the three sundew species in the drosera dataset differ considerably in their leaf morphology. For more appropriate displays of that dataset, see below in the **ggcorrm() examples** chapter. The "binned" style is a useful preset for very large datasets as it groups the data into bins (using a 10 by 10 grid in the standard settings) and plots them as points with a point size scaled by the number of observations, which speeds up plotting considerably. For this reason, the function is useful e.g. to inspect posterior correlations in MCMC draws. ``` {r} # simulate large correlated dataset from a multivariate normal distribution set.seed(111) # set random seed A <- matrix(runif(6 * 6, -1, 1), nrow = 6) # prepare 6*6 cov. matrix large <- MASS::mvrnorm(10000, rep(0, 6), t(A) %*% A) # sample 10000 replicates colnames(large) <- paste("Var.", 1:6) # set column names # plot with corrmorant() using the "binned" style setting corrmorant(large, style = "binned")
The corrmorant()
interface is kept simple on purpose as this function is mainly intended for fast visual checks of datasets. If you want to modify single elements of the plot, it is almost always preferable to use ggcorrm()
instead.
The ggcorrm()
function allows to create plots of correlation matrices layer by layer using a slightly extended regular ggplot2 syntax. If ggcorrm()
is called on a matrix or data.frame, it automatically rearranges it into the format required for plotting, treating all numeric variables as columns/rows of the correlation matrix, and retaining all other data as additional columns of the plotting dataset that can be used for grouping.
You can add new layers and other ggplot2 plot elements (e.g. scales, themes etc.) to a ggcorrm()
object as if it was a regular ggplot()
call. However, adding the same layers to all panels of a correlation matrix is rarely useful:
# create subset of drosera dataset for a single species capensis <- filter(drosera, species == "capensis") # add scatterplot to all panels ggcorrm(capensis) + geom_point(alpha = 0.5)
Usually, you will want to display different graphical elements in the plot diagonal, lower and upper triangular panels of the correlation matrix. This can be achieved with the three corrmorant selector functions lotri()
, utri()
and dia()
, which can be used to modify ggplot layers (i.e. the output of calls to geoms, stats or the layer()
function) and direct them only to the lower or upper triangle or the plot diagonal of a ggcorrm()
plot, respectively. To use them, you simply have to wrap the ggplot layer into one of the selector functions:
ggcorrm(capensis) + lotri(geom_point(alpha = 0.5))
Internally, the selector functions modify the data passed to the layer to make sure only a subset for the desired panels is passed on.
In addition to the default geoms and stats available in ggplot2, corrmorant introduces a series of new geoms and stats that simplify the creation of visually appealing correlation matrices. These can be grouped into a) data summaries and variable-specific information on the plot diagonal and b) data display and summaries in the off-diagonal panels. To simplify their usage, all corrmorant stats can be called directly by prefixing their name with dia_
, lotri_
or utri_
instead of calling them via the corresponding selector functions. The prefixed versions differ from calls to the underlying stats by having a set of reasonable standard values and are preferable in most cases.
Panels Name Description
Diagonal dia_names()
display text labels for variable names on the
plot diagonal
`dia_density()` display density plots in diagonal panels `dia_histogram()` display histograms in diagonal panels `dia_freqpoly()` display frequency polygons in diagonal panels
Off-diagonal lotri_corrtext()
place text labels indicating correlation
utri_corrtext()
strength
`lotri_funtext()` create text labels with the output of `utri_funtext()` user-definedfunctions `lotri_heatmap()` add correlation heatmap `utri_heatmap()` `lotri_heatpoint()` add symbols whose size and color indicates `utri_heatpoint()` correlation strength `lotri_heatcircle()` add circles whose area scales with `utri_heatcircle()` correlation strength
Together with regular ggplot2 geoms and stats called via lotri()
, utri()
and dia()
, these new functions can be used as building blocks for more complex correlation matrix plots.
For example, corrmorant(drosera, style = "light")
can be recreated by the following code:
ggcorrm(drosera) + lotri(geom_point(alpha = 0.5)) + # scatterplots in lower triangle utri_corrtext() + # correlation coef. in upper triangle dia_names(size = 3) + # variable names in plot diagonal dia_density(fill = "grey80", color = 1) # density plots in plot diagonal
Often, there are groups in the data that differ considerably in their relationship. For example, the drosera dataset contains data from three different sundew species. Grouped corrmorant plots can easily be achieved by mapping the fill
and/or color
aesthetics to the grouping variable, either for the entire plot or in the calls of the respective geoms/stats.
ggcorrm(drosera, mapping = aes(color = species, fill = species)) + # plot-level aesthetics lotri(geom_point(alpha = 0.5)) + # scatterplot in lower triangle utri_corrtext(squeeze = 0.5) + # correlation coef. in upper triangle dia_names() + # variable names in plot diagonal dia_density(color = 1, alpha = 0.5) + # density plot in plot diagonal scale_x_log10() + # log-transformed x axis scale_y_log10() # log-transformed y axis
corrmorant passes an additional data column to the layers with information about the correlation strength. This column is called .corr
and can be used as an aesthetic on the plot and layer level. Different correlation metrics can be specified via the corr_method
argument of ggcorrm()
, and if desired the calculation can also be performed on sub-groups via the corr_groups
argument. scale_color_corr()
and scale_fill_corr()
provide color and fill scales that are appropriate for the use of correlation matrices.
The following plot displays a subset of the large
dataset created above, colored by Kendall's tau:
ggcorrm(large[1:50,], # plot first 50 rows of the large dataset mapping = aes(color = .corr, fill = .corr), # plot-level aes by .corr corr_method = "kendall") + lotri(geom_smooth(method = "lm")) + # regression line in lower triangle lotri(geom_point(alpha = 0.3)) + # scatterplot in lower triangle utri_corrtext() + # correlation coef. in upper triangle dia_names() + # variable names in plot diagonal dia_histogram(color = 1, alpha = 0.5) + # histograms in plot diagonal scale_color_corr( # colour/fill scale for correlations aesthetics = c("colour", "fill"))
Often it can be interesting to compute certain statistics for all combinations of variables in the scatterplot matrix. Examples include linear model slopes, elevation and slope from an SMA model, explained variance etc. lotri_funtext()
and utri_funtext()
compute pairwise data summaries based on user-specified functions that can be used to achieve this.
The following example shows the use of lotri_funtext()
to compute the number of complete pairs of variables in a dataset with missing data.
# add 10% missing data to the capensis dataset set.seed(1) capensis_missing <- capensis %>% mutate_if(is.numeric, ~ifelse(runif(50) < 0.9, .x, NA)) # define function that computes number of complete pairs complete <- function(x, y) cbind(x, y) %>% na.omit() %>% nrow() %>% paste("n =", .) # Add number of complete cases over a regular corrmorant plot corrmorant(capensis_missing) + utri_funtext(fun = complete, vjust = 2.5)
The labels
argument of corrmorant()
and ggcorrm()
makes it easy to replace the original column names of the numeric columns in a dataset by a set of labels, either by specifying them as a character vector or by supplying a function that returns a character vector of the desired length.
For example, it is easy to convert the variable names in the capensis
dataset in a more readable format:
# function to remove underscores and capitalize first letter make_nice_labels <- function(x){ x <- gsub("_", " ", x) paste0(toupper(substr(x, 1, 1)), substr(x, 2, nchar(x))) } # create plot corrmorant(capensis, labels = make_nice_labels)
Labels can also be supplied as a character vector:
# function to remove underscores and capitalize first letter using stringr labels <- make_nice_labels(names(capensis)[-c(1:2)]) %>% paste("(mm)") # create plot corrmorant(capensis, labels = labels)
Labels containing expressions can be parsed for proper display. The parse
argument can be set to TRUE
to parse labels in dia_names()
. The ggcorrm()
plot grid is created via facets. A new labeller function (like the built-in label_parsed()
) can be passed to the grid via the facet_arg
argument of ggcorrm()
:
# using facet_args and dia_names(parse = TRUE) to parse variable names ggcorrm(large[1:50,], labels = c("gamma", "beta", "alpha[3]", "phi^2", "B~('cm'^2)", "sqrt(pi^2)"), facet_arg = list(labeller = "label_parsed"), bg_dia = "grey30") + utri_heatcircle(col = 1, size = .3) + lotri_corrtext() + dia_names(y_pos = 0.5, parse = TRUE, col = "white") + scale_fill_corr()
Under the hood, ggcorrm arranges the facets of the correlation matrix using ggplot2::facet_grid()
. To be able to arrange a dataset across the facets, as a first step, ggcorrm()
converts the data in a long-table format suitable for plotting.
This is achieved by calling a tidy_corrm
method, which automatically rearranges the columns in a suitable way and returns them in a tidy long form in a tidy_corrm
object (a special type of tibble):
tidy_corrm(drosera)
As seen in the given example, there is a number of standard column names that describe the content of all numeric variables in the facets (var_x:corr_group
), while all categorical variables are carried over with the dataset in their original form and can be used for grouping (in this case only Species
). Additional columns can be created from these columns using the mutates
argument (see below).
The content of the columns in a tidy_corrm
object is as follows:
Column Content
var_x
Name of the variable on the x-axis in the order of
appearance in the raw data (ordered factor).
var_y
Name of the variable on the y-axis in the order of appearance
in the raw data (ordered factor).
x
Data of the variable on the x axis (numeric).
y
Data of the variable on the y axis (numeric).
type
Type of panel (character, "upper"
, "lower"
or "diag"
).
.corr
Correlation between x and y for the respective panel/group,
stats::cor()
using the method specified by
corr_method
and optionally within the groups specified with
corr_group
(numeric).
corr_group
grouping variable for .corr
(1 for all observations if
no groups are specified).
Additional All other columns specified in the dataset and/or
columns created via mutates
.
It is also possible to call ggcorrm()
directly on a tidy_corrm
object. In this case, it will be directly plotted without reshaping of the data, and arguments that ggcorrm()
passes to tidy_corrm()
are ignored.
tidy_corrm()
provides a set of options;
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.