knitr::opts_chunk$set( collapse = TRUE, comment = "#>" )
(This publication is still a work in progress)
"What is to be sought in designs for the display of information is the clear portrayal of complexity. Not the complication of the simple; rather ... the revelation of the complex."
- Edward R. Tufte
The ggstatsplot
package is an opinionated collection of plots made with
ggplot2
and is designed for exploratory data analysis or for producing
publication-ready statistical graphics. All plots share an underlying principle
of displaying information-rich plots with all necessary statistical details
included in the plots themselves. Although the plots produced by ggstatsplot
are still ggplot
objects and can thus be further modified using ggplot2
commands, there is a limit to how many such modifications can be made. That is,
it is less flexible than ggplot2
, but that's a feature and not a bug. The
original intent behind this package is to offload struggles associated with
constructing the plot and focus more on the interpretation of that data
displayed in the plot.
Graphical perception involves visual decoding of the encoded information in
graphs. ggstatsplot
incorporates the paradigm proposed in Cleveland (1985,
Chapter 4) to facilitate making visual judgments about quantitative information
effortless and almost instantaneous. Based on experiments, Cleveland proposes
that there are ten elementary graphical-perception tasks that we perform to
visually decode quantitative information in graphs (organized from most to least
accurate; Cleveland, 1985, p.254)-
So the key principle of Cleveland's paradigm for data display is-
"We should encode data on a graph so that the visual decoding involves [graphical-perception] tasks as high in the ordering as possible."
For example, decoding the data point values in ggbetweenstats
requires
position judgments along a common scale (Figure-1):
# for reproducibility set.seed(123) # plot ggstatsplot::ggbetweenstats( data = dplyr::filter( .data = ggstatsplot::movies_long, genre %in% c("Action", "Action Comedy", "Action Drama", "Comedy") ), x = genre, y = rating, title = "Figure-1: IMDB rating by film genre", xlab = "Genre", ylab = "IMDB rating (average)", pairwise.comparisons = TRUE, p.adjust.method = "bonferroni", ggtheme = hrbrthemes::theme_ipsum_tw(), ggstatsplot.layer = FALSE, outlier.tagging = TRUE, outlier.label = title, messages = FALSE )
There are few instances where ggstatsplot
diverges from recommendations made
in Cleveland's paradigm:
ggstatsplot
uses pie charts (see Figure-2)
which rely on angle judgments, which are less accurate (as compared to bar
graphs, e.g., which require position judgments). This shortcoming is assuaged
to some degree by using plenty of labels that describe percentages for all
slices. This makes angle judgment unnecessary and pre-vacates any concerns about
inaccurate judgments about percentages. Additionally, it also provides
alternative function to ggpiestats
for working with categorical variables:
ggbarstats
.# for reproducibility set.seed(123) # plot ggstatsplot::ggpiestats( data = ggstatsplot::movies_long, main = genre, condition = mpaa, title = "Figure-2: Distribution of MPAA ratings by film genre", legend.title = "layout", caption = substitute(paste( italic("MPAA"), ": Motion Picture Association of America" )), palette = "Paired", messages = FALSE )
grouped_
variants of the function (see
Figure-3). Note that the range for Y-axes are no longer the same across
juxtaposed subplots and so visually comparing the data becomes difficult. On the
other hand, in the superposed plot, all data have the same range and coloring
different parts makes the visual discrimination of different components of the
data, and their comparison, easier. But the goal of grouped_
variants of
functions is to not only show different aspects of the data but also to run
statistical tests and showing detailed results for all aspects of the data in a
superposed plot is difficult. Therefore, this is a compromise ggstatsplot
is
comfortable with, at least to produce plots for quick exploration of different
aspects of the data. # for reproducibility set.seed(123) # plot ggstatsplot::combine_plots( # plot 1: superposition ggplot2::ggplot( data = dplyr::filter(ggstatsplot::movies_long, genre == "Comedy" | genre == "Drama"), mapping = ggplot2::aes( x = length, y = rating, color = genre ) ) + ggplot2::geom_jitter(size = 3, alpha = 0.5) + ggplot2::geom_smooth(method = "lm") + ggplot2::labs(title = "superposition (recommended in Cleveland's paradigm)") + ggstatsplot::theme_ggstatsplot(), # plot 2: juxtaposition ggstatsplot::grouped_ggscatterstats( data = dplyr::filter(ggstatsplot::movies_long, genre == "Comedy" | genre == "Drama"), x = length, y = rating, grouping.var = genre, marginal = FALSE, messages = FALSE, title.prefix = "Genre", title.text = "juxtaposition (`ggstatsplot` implementation in `grouped_` functions)", title.size = 12 ), # combine for comparison title.text = "Two ways to compare different aspects of data", nrow = 2, labels = c("(a)", "(b)") )
The grouped_
plots follow the Shrink Principle (Tufte, 2001, p.166-7) for
high-information graphics, which dictates that the data density and the size of
the data matrix can be maximized to exploit maximum resolution of the available
data-display technology. Given the large maximum resolution afforded by most
computer monitors today, saving grouped_
plots with appropriate resolution
ensures no loss in legibility with reduced graphics area.
Graphical excellence consists of communicating complex ideas with clarity and in a way that the viewer understands the greatest number of ideas in a short amount of time all the while not quoting the data out of context. The package follows the principles for graphical integrity (as outlined in Tufte, 2001):
The physical representation of numbers is proportional to the numerical
quantities they represent (e.g., Figure-1 and Figure-2 show how means (in
ggbetweenstats
) or percentages (ggpiestats
) are proportional to the
vertical distance or the area, respectively).
All important events in the data have clear, detailed, and thorough labeling
(e.g., Figure-1 plot shows how ggbetweenstats
labels means, sample size
information, outliers, and pairwise comparisons; same can be appreciated for
ggpiestats
in Figure-2 and gghistostats
in Figure-5). Note that data
labels in the data region are designed in a way that they don't interfere
with our ability to assess the overall pattern of the data (Cleveland, 1985;
p.44-45). This is achieved by using ggrepel
package to place labels in a
way that reduces their visual prominence.
None of the plots have design variation (e.g., abrupt change in scales) over the surface of a same graphic because this can lead to a false impression about variation in data.
The number of information-carrying dimensions never exceed the number of dimensions in the data (e.g., using area to show one-dimensional data).
All plots are designed to have no chartjunk (like moirĂ© vibrations, fake perspective, dark grid lines, etc.) (Tufte, 2001, Chapter 5).
There are some instances where ggstatsplot
graphs don't follow principles of
clean graphics, as formulated in the Tufte theory of data graphics (Tufte, 2001,
Chapter 4). The theory has four key principles:
In particular, default plots in ggstatsplot
can sometimes violate one of the
principles from 2-4. According to these principles, every bit of ink should have
reason for its inclusion in the graphic and should convey some new information
to the viewer. If not, such ink should be removed. One instance of this is
bilateral symmetry of data measures. For example, in Figure-1, we can see that
both the box and violin plots are mirrored, which consumes twice the space in
the graphic without adding any new information. But this redundancy is tolerated
for the sake of beauty that such symmetrical shapes can bring to the graphic.
Even Tufte admits that efficiency is but one consideration in the design of
statistical graphics (Tufte, 2001, p. 137). Additionally, these principles were
formulated in an era in which computer graphics had yet to revolutionize the
ease with which graphics could be produced and thus some of the concerns about
minimizing data-ink for easier production of graphics are not as relevant as
they were.
As an extension of ggplot2
, ggstatsplot
has the same expectations about the
structure of the data. More specifically,
The data
should be an object of class data.frame
(a tibble
dataframe
will also work).
The data should be organized following the principles of tidy data, which specify how statistical structure of a data frame (variables and observations) should be mapped to physical structure (columns and rows). More specifically, tidy data means all variables have their own columns and each row corresponds to a unique observation (Wickham, 2014).
All ggstatsplot
functions remove NA
s from variables of interest (similar
to ggplot2
; Wickham, 2016, p.207) in the data and display total sample
size (n) in the subtitle to inform the user/reader about the number of
observations included for both the statistical analysis and the
visualization. But, when sample sizes differ across tests in the same
function, ggstatsplot
makes an effort to inform the user of this aspect.
For example, ggcorrmat
features several correlation test pairs and,
depending on variables in a given pair, the sample sizes may vary
(Figure-4).
# creating a new dataset without any NAs in variables of interest msleep_no_na <- dplyr::filter( .data = ggplot2::msleep, !is.na(sleep_rem), !is.na(awake), !is.na(brainwt), !is.na(bodywt) ) # variable names vector var_names <- c( "REM sleep", "time awake", "brain weight", "body weight" ) # combining two plots ggstatsplot::combine_plots( # plot *without* any NAs ggstatsplot::ggcorrmat( data = msleep_no_na, corr.method = "kendall", sig.level = 0.001, p.adjust.method = "holm", cor.vars = c(sleep_rem, awake:bodywt), cor.vars.names = var_names, matrix.type = "upper", colors = c("#B2182B", "white", "#4D4D4D"), title = "Correlalogram for mammals sleep dataset", subtitle = "sleep units: hours; weight units: kilograms", messages = FALSE ), # plot *with* NAs ggstatsplot::ggcorrmat( data = ggplot2::msleep, corr.method = "kendall", sig.level = 0.001, p.adjust.method = "holm", cor.vars = c(sleep_rem, awake:bodywt), cor.vars.names = var_names, matrix.type = "upper", colors = c("#B2182B", "white", "#4D4D4D"), title = "Correlalogram for mammals sleep dataset", subtitle = "sleep units: hours; weight units: kilograms", messages = FALSE ), labels = c("(a)", "(b)"), nrow = 1 )
Functions | Description | Parametric | Non-parametric | Robust | Bayes Factor
------- | ------------------ | ---- | ----- | ----| -----
ggbetweenstats
| Between group/condition comparisons | $\checkmark$ | $\checkmark$ | $\checkmark$ | $\checkmark$
ggwithinstats
| Within group/condition comparisons | $\checkmark$ | $\checkmark$ | $\checkmark$ | $\checkmark$
gghistostats
, ggdotplotstats
| Distribution of a numeric variable | $\checkmark$ | $\checkmark$ | $\checkmark$ | $\checkmark$
ggcorrmat
| Correlation matrix | $\checkmark$ | $\checkmark$ | $\checkmark$ | $\times$
ggscatterstats
| Correlation between two variables | $\checkmark$ | $\checkmark$ | $\checkmark$ | $\checkmark$
ggpiestats
, ggbarstats
| Association between categorical variables | $\checkmark$ | NA
| NA
| $\checkmark$
ggpiestats
, ggbarstats
| Equal proportions for categorical variable levels | $\checkmark$ | NA
| NA
| $\times$
ggcoefstats
| Regression model coefficients | $\checkmark$ | $\times$| $\checkmark$ | $\times$
Functions | Type | Test | Effect size | 95% CI available?
----------- | ----------- | ------------------ | ------------------ | -----
ggbetweenstats
| Parametric | Student's and Welch's t-test | Cohen's d, Hedge's g | $\checkmark$
ggbetweenstats
| Parametric | Fisher's and Welch's one-way ANOVA | $$\eta^2, \eta^2_p, \omega^2, \omega^2_p$$ | $\checkmark$
ggbetweenstats
| Non-parametric | Mann-Whitney U-test | r | $\checkmark$
ggbetweenstats
| Non-parametric | Kruskal-Wallis Rank Sum Test | $$\eta^2_H$$ | $\checkmark$
ggbetweenstats
| Robust | Yuen's test for trimmed means | $$\xi$$ | $\checkmark$
ggbetweenstats
| Robust | Heteroscedastic one-way ANOVA for trimmed means | $$\xi$$ | $\checkmark$
ggwithinstats
| Parametric | Student's t-test | Cohen's d, Hedge's g | $\checkmark$
ggwithinstats
| Parametric | Fisher's one-way repeated measures ANOVA | $$\eta^2_p, \omega^2$$ | $\checkmark$
ggwithinstats
| Non-parametric | Wilcoxon signed-rank test | r | $\checkmark$
ggwithinstats
| Non-parametric | Friedman test | $$W_{Kendall}$$ | $\checkmark$
ggwithinstats
| Robust | Yuen's test on trimmed means for dependent samples | $$\xi$$ | $\checkmark$
ggwithinstats
| Robust | Heteroscedastic one-way repeated measures ANOVA for trimmed means | $\times$ | $\times$
ggpiestats
| Parametric | $$\text{Pearson's}~ \chi^2 ~\text{test}$$ | Cramer's V | $\checkmark$
ggpiestats
| Parametric | McNemar's test | Cohen's g | $\checkmark$
ggpiestats
| Parametric | One-sample proportion test | Cramer's V | $\checkmark$
ggscatterstats
/ggcorrmat
| Parametric | Pearson's r | r | $\checkmark$
ggscatterstats
/ggcorrmat
| Non-parametric | $$\text{Spearman's}~ \rho$$ | $$\rho$$ | $\checkmark$
ggscatterstats
/ggcorrmat
| Robust | Percentage bend correlation | r | $\checkmark$
gghistostats
/ggdotplotstats
| Parametric | One-sample t-test | Cohen's d, Hedge's g | $\checkmark$
gghistostats
| Non-parametric | One-sample Wilcoxon signed rank test | r | $\checkmark$
gghistostats
/ggdotplotstats
| Robust | One-sample percentile bootstrap | robust estimator | $\checkmark$
gghistostats
/ggdotplotstats
| Parametric | Regression models | $$\beta$$ | $\checkmark$
For the ggbetweenstats
function, the following post-hoc tests are available
for (adjusted) pairwise multiple comparisons:
Type | Design | Equal variance assumed? | Pairwise comparison test | p-value adjustment?
----------- | ----------- | --------- | ----------------------- | -----
Parametric | between-subjects | No | Games-Howell test | $\checkmark$
Parametric | between-subjects | Yes | Student's t-test | $\checkmark$
Parametric | within-subjects | NA
| Student's t-test | $\checkmark$
Non-parametric | between-subjects | No | Dwass-Steel-Crichtlow-Fligner test | $\checkmark$
Non-parametric | within-subjects | No | Durbin-Conover test | $\checkmark$
Robust | between-subjects | No | Yuen's trimmed means test | $\checkmark$
Robust | within-subjects | NA
| Yuen's trimmed means test | $\checkmark$
Bayes Factor | between-subjects | No | $\times$ | $\times$
Bayes Factor | between-subjects | Yes | $\times$ | $\times$
Bayes Factor | within-subjects | NA
| $\times$ | $\times$
Note-
NA
: not applicableOne of the important functions of a plot is to show the variation in the data, which comes in two forms:
ggstatsplot
, the actual variation in
measurements is shown by plotting a combination of (jittered) raw data
points with a boxplot laid on top (Figure-1) or a histogram (Figure-5). None
of the plots, where empirical distribution of the data is concerned, show
the sample standard deviation because they are poor at conveying information
about limits of the sample and presence of outliers (Cleveland, 1985,
p.220).# for reproducibility set.seed(123) # plot ggstatsplot::gghistostats( data = morley, x = Speed, test.value = 792, test.value.line = TRUE, xlab = "Speed of light (km/sec, with 299000 subtracted)", title = "Figure-5: Distribution of Speed of light", caption = "Note: Data collected across 5 experiments (20 measurements each)", messages = FALSE )
ggstatsplot
plots instead use 95% confidence intervals (e.g.,
Figure-6). This is because the interval formed by error bars correspond to
a 68% confidence interval, which is not a particularly interesting interval
(Cleveland, 1985, p.222-225).# for reproducibility set.seed(123) # creating model object mod <- lme4::lmer( formula = total.fruits ~ nutrient + rack + (nutrient | popu / gen), data = lme4::Arabidopsis ) # plot ggstatsplot::ggcoefstats( x = mod, p.kr = FALSE )
The default setting in ggstatsplot
is to produce plots with statistical
details included. Most often than not, the results are displayed as a subtitle
in the plot. Great care has been taken into which details are included in
statistical reporting and why.
With the exception of p-values, most statistics are rounded to two decimal places.
Default statistical tests:
Dealing with null results:
Avoiding the "p-value error":
The p-value indexes the probability that the researchers have falsely
rejected a true null hypothesis (Type I error, i.e.) and can rarely be
exactly 0. And yet over 97,000 manuscripts on Google Scholar report the
p-value to be p = 0.000
(Lilienfeld et al., 2015), putatively due to
relying on default computer outputs. All p-values displayed in
ggstatsplot
plots avoid this mistake. Anything less than p < 0.001
is
displayed as such (e.g, Figure-1). The package deems it unimportant how
infinitesimally small the p-values are and, instead, puts emphasis on the
effect size magnitudes and their 95% CIs.
Attempt has been made to make the application program interface (API) consistent enough that no struggle is expected while thinking about specifying function calls-
data
argument
must always be specified.ggstatsplot
functions consistently expect tidy/long form
data. x = "var1"
) and unquoted (x = var1
)
arguments.There are three main documents one can rely on to learn how to use
ggstatsplot
:
Presentation: The quickest (and the most fun) way to get an overview of the philosophy behind this package and the offered functionality is to go through the following slides: https://indrajeetpatil.github.io/ggstatsplot_slides/slides/ggstatsplot_presentation.html#1
Manual:
The CRAN
reference manual provides detailed documentation about arguments
for each function and examples:
https://cran.r-project.org/web/packages/ggstatsplot/ggstatsplot.pdf
README:
The GitHub README
document provides a quick summary of all available
functionality without going too much into details:
https://github.com/IndrajeetPatil/ggstatsplot/blob/master/README.md
Vignettes:
Vignettes contain probably the most detailed exposition. Every single
function in ggstatsplot
has an associated vignette which describes in
depth how to use the function and modify the defaults to customize the plot
to your liking. All these vignettes can be accessed from the package
website: https://indrajeetpatil.github.io/ggstatsplot/articles/
If you find any bugs or have any suggestions/remarks, please file an issue on
GitHub
repository for this package:
https://github.com/IndrajeetPatil/ggstatsplot/issues
Summarizing session information for reproducibility.
options(width = 200) devtools::session_info()
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.