library(tidyverse) library(knitr) library(kableExtra) knitr::opts_chunk$set(message = FALSE, warning = FALSE) # data wide_df <- tribble(~restaurant, ~food_rating, ~service_rating, ~ambience_rating, ~overall_rating, "R1", 4.3, 3.4, 4.3, 4.9, "R2", 4.3, 5.0, 4.5, 4.4, "R3", 3.2, 4.4, 5.0, 3.0, "R4", 2.3, 4.6, 4.4, 3.8, "R5", 3.9, 4.8, 4.2, 3.3) long_df <- pivot_longer(wide_df, contains("rating"), names_to = "rating_type", names_pattern = "(.+)_rating", values_to = "rating") # abund_df <- tribble(~Site, ~`Soil dry mass`, ~`Moss`, ~Alopcune, ~Arctlute, ~Pardpull, ~Trocterr, ~Zoraspin, # 1, 2.3321, 3.0445, 10, 0, 45, 57, 4, # 2, 3.0493, 1.0986, 2, 0, 37, 65, 9, # 3, 2.5572, 2.3979, 20, 0, 45, 66, 1) # table_styling <- function(df, ...) { kable(df, ..., booktabs = TRUE, linesep = "") %>% kable_classic() } # # data("spider", package = "mvabund") data(SnowGR, package = "mosaicData")
The layered grammar of graphics [@Wickham2010-kt], implemented as the ggplot2
package [@Wickham2016] in the statistical language R [@rstats], is a powerful and popular tool to create versatile statistical graphics. However, this graphical system requires input data to be organised in a manner that a data column is mapped to an aesthetic element (e.g., x-coordinate, y-coordinate, color, size), which creates friction in constructing plots with an aesthetic element that spans multiple columns in the original data, as it requires the user to reorganise the data.
<!-- FKCH: Should we use the words long and wide format data frames here, or is that wrong? I only ask because I (maybe others?) think about ggplot requiring long format data, but we want to wide formatting?
XL: Regarding Francis' question, I don't know if the long format or wide format data are well defined in the literature. If it is, we definitely can mention it. If not, do we need to define it? --commented by XL
FKCH: @Emi you know the literature best. Can you make a decision on this? Thanks!
ET: I don't think we should mention long/wide -- there's more nuance than that. -->
The ggmatplot
, built upon ggplot2
, is an R-package that allows quick plotting across the columns of matrices or data with the result returned as a ggplot
object. The package is inspired by the function matplot()
in the core R graphics
system -- as such, ggmatplot
may be considered as a ggplot
version of matplot
, with the benefits of customising the plots as any other ggplot
objects via ggplot2
functions, as well as offering several other plotting types that are not immediately available from matplot
directly e.g., comparative violin plots. The ggmatplot
package is available on the Comprehensive R Archive Network (CRAN) with the latest developmental source code available at \url{https://github.com/xuan-liang/ggmatplot}.
Input data to construct plots with ggplot2
require data to be organised in a manner that maps data columns to aesthetic elements. This generally works well where data is tidied in a long rectangular form, often referred to as "tidy data" [@Wickham2014-gy], where each row represents an observational unit, each column represents a variable, and each cell represents a value. However, in some cases what constitutes a variable (or observational unit), and hence a column (or row), in tidy data can be dependent upon interpretation or downstream interest (e.g. Tables \ref{tab:tab1} and \ref{tab:tab2} can both be considered as tidy data), but a clear violation of tidy data principles is when the column names contain data values, e.g. Table \ref{tab:tab3} contains months of the year across a number of column names.
table_styling(wide_df, col.names = c("Restaurant", "Food", "Service", "Ambience", "Overall"), caption = 'Restaurant rating data in "tidy" form. The first column shows the restaurant ID, and the next four columns show the average ratings (out of 5) for food, service, ambience and overall, respectively.') %>% add_header_above(c(" ", "Average rating" = 4))
table_styling(long_df, col.names = c("Restauant", "Rating type", "Average rating"), caption = 'Another form for the restaurant rating data in Table \\ref{tab:tab1}. In Wickham (2014), this format is called the "molten" data.')
head(SnowGR[, 1:11]) %>% mutate(Jan = as.character(Jan)) %>% replace_na(list(Jan = ".")) %>% table_styling(caption = 'The first 6 rows and 11 columns of the snowfall data for Grand Rapids, Michigan in the R pacakge \\texttt{mosaicData} (Prium, Kaplan \\& Horton, 2021).', align = "r")
The organisation of the data is largely dependent on the subsequent analysis, and there is no one correct way to do this. Some forms of multivariate data, e.g. Table \ref{tab:tab3}, are prevalent in many scientific fields because it aligns with the input data for a particular modelling software, and/or the format is more convenient for input or view of the data in spreadsheet format (say). Unfortunately, this format is not consistent with the required format for ggplot2
, and consequently plotting with ggplot2
interrupts the workflow of a user that is trying to quickly visualise these types of data (as part of their exploratory data analysis, for example). The ggmatplot
R-package seeks to provide a solution to this common friction.
In this section, we demonstrate the use of the ggmatplot
package and contrast the specification with ggplot2
after data wrangling using dplyr
and tidyr
[@Wickham2019]. We will use the example data in Tables \ref{tab:tab1} and \ref{tab:tab3}, which are stored in the objects wide_df
and SnowGR
, respectively.
The code below constructs a line plot (superimposed with a point) of the various types (food, service and ambience) of ratings, contained in columns 2 to 4 of wide_df
, against the overall rating in column 5 of wide_df
as shown in Figure 1.
library(ggmatplot) ggmatplot(x = wide_df[, 5], y = wide_df[, 2:4], plot_type = "both", xlab = "Overall rating", ylab = "Rating", legend_title = "Type")
In contrast to the above, using ggplot2
alone to obtain the same result as Figure 1 requires the data be wrangled to a long form first before plotting. This is exemplified in the code below, from which we identify a small, but noticeable, friction to the workflow for the practitioner that is looking to promptly explore their data.
library(ggplot2) library(tidyr) # or library(tidyverse) wide_df %>% select(contains("rating")) %>% pivot_longer(-overall_rating, names_to = "rating_type", values_to = "rating") %>% mutate(rating_type = fct_inorder(rating_type)) %>% ggplot(aes(x = overall_rating, y = rating, color = rating_type)) + geom_point(aes(shape = rating_type)) + geom_line(aes(group = rating_type, linetype = rating_type)) + labs(x = "Overall rating", y = "Rating", color = "Type", linetype = "Type", shape = "Type")
The example code draws the boxplot of each column of amount of snowfall across months in the SnowGR
data, as presented in Figure 2. As the resulting object is a ggplot
object, the user can leverage the ggplot
functions to modify the output (e.g., addition of a title).
library(ggmatplot) ggmatplot(x = SnowGR[, 3:14], plot_type = "boxplot", xlab = "Month", ylab = "Snowfall") + ggtitle("Grand Rapids, Michigan, 1893-2011")
The equivalent code to produce Figure 2 without using ggmatplot
is given below. Again, we observe a slight but non-negligible friction in putting the data in the right format prior to plotting. The original wide data format as seen in Table \ref{tab:tab3} is common in the environmental sciences, among other disciplines, and thus an analyst who has to repeat these tasks can benefit from a quick approach as ggmatplot
offers.
library(ggplot2) library(tidyr) library(forcats) # or library(tidyverse) SnowGR %>% pivot_longer(Jul:Jun, names_to = "Month", values_to = "Snowfall") %>% mutate(Month = fct_inorder(Month)) %>% ggplot(aes(Month, Snowfall)) + geom_boxplot(alpha = 0.5) + ggtitle("Grand Rapids, Michigan, 1893-2011")
The ggmatplot
R-package provides a solution to a common friction encountered when wanting to quickly plot multivariate data, where the primary interest is mapping the column names as an aesthetic element. While an important start, we also acknowledge that solution provided is a recipe-driven approach, where the user can only produce plot types as many there are included in the plot_type
option. Future developments of the package could benefit from using a grammar approach, like in @Wilkinson2005-oz and @Wickham2010-kt, where plot types can be extensible. The latest developmental source code can be found at \url{https://github.com/xuan-liang/ggmatplot}. Further examples can be found at \url{https://xuan-liang.github.io/ggmatplot/}.
FKCH was supported by an Australian Research Council Discovery Fellowship DE200100435.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.