knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)

(This publication is still a work in progress)

"What is to be sought in designs for the display of information is the clear portrayal of complexity. Not the complication of the simple; rather ... the revelation of the complex."
- Edward R. Tufte

Introduction

The ggstatsplot package is an opinionated collection of plots made with ggplot2 and is designed for exploratory data analysis or for producing publication-ready statistical graphics. All plots share an underlying principle of displaying information-rich plots with all necessary statistical details included in the plots themselves. Although the plots produced by ggstatsplot are still ggplot objects and can thus be further modified using ggplot2 commands, there is a limit to how many such modifications can be made. That is, it is less flexible than ggplot2, but that's a feature and not a bug. The original intent behind this package is to offload struggles associated with constructing the plot and focus more on the interpretation of that data displayed in the plot.

Graphical perception

Graphical perception involves visual decoding of the encoded information in graphs. ggstatsplot incorporates the paradigm proposed in Cleveland (1985, Chapter 4) to facilitate making visual judgments about quantitative information effortless and almost instantaneous. Based on experiments, Cleveland proposes that there are ten elementary graphical-perception tasks that we perform to visually decode quantitative information in graphs (organized from most to least accurate; Cleveland, 1985, p.254)-

So the key principle of Cleveland's paradigm for data display is-

"We should encode data on a graph so that the visual decoding involves [graphical-perception] tasks as high in the ordering as possible."

For example, decoding the data point values in ggbetweenstats requires position judgments along a common scale (Figure-1):

# for reproducibility
set.seed(123)

# plot
ggstatsplot::ggbetweenstats(
  data = dplyr::filter(
    .data = ggstatsplot::movies_long,
    genre %in% c("Action", "Action Comedy", "Action Drama", "Comedy")
  ),
  x = genre,
  y = rating,
  title = "Figure-1: IMDB rating by film genre",
  xlab = "Genre",
  ylab = "IMDB rating (average)",
  pairwise.comparisons = TRUE,
  p.adjust.method = "bonferroni",
  ggtheme = hrbrthemes::theme_ipsum_tw(),
  ggstatsplot.layer = FALSE,
  outlier.tagging = TRUE,
  outlier.label = title,
  messages = FALSE
)

There are few instances where ggstatsplot diverges from recommendations made in Cleveland's paradigm:

# for reproducibility
set.seed(123)

# plot
ggstatsplot::ggpiestats(
  data = ggstatsplot::movies_long,
  main = genre,
  condition = mpaa,
  title = "Figure-2: Distribution of MPAA ratings by film genre",
  legend.title = "layout",
  caption = substitute(paste(
    italic("MPAA"), ": Motion Picture Association of America"
  )),
  palette = "Paired",
  messages = FALSE
)
# for reproducibility
set.seed(123)

# plot
ggstatsplot::combine_plots(
  # plot 1: superposition
  ggplot2::ggplot(
    data = dplyr::filter(ggstatsplot::movies_long, genre == "Comedy" |
      genre == "Drama"),
    mapping = ggplot2::aes(
      x = length,
      y = rating,
      color = genre
    )
  ) +
    ggplot2::geom_jitter(size = 3, alpha = 0.5) +
    ggplot2::geom_smooth(method = "lm") +
    ggplot2::labs(title = "superposition (recommended in Cleveland's paradigm)") +
    ggstatsplot::theme_ggstatsplot(),
  # plot 2: juxtaposition
  ggstatsplot::grouped_ggscatterstats(
    data = dplyr::filter(ggstatsplot::movies_long, genre == "Comedy" |
      genre == "Drama"),
    x = length,
    y = rating,
    grouping.var = genre,
    marginal = FALSE,
    messages = FALSE,
    title.prefix = "Genre",
    title.text = "juxtaposition (`ggstatsplot` implementation in `grouped_` functions)",
    title.size = 12
  ),
  # combine for comparison
  title.text = "Two ways to compare different aspects of data",
  nrow = 2,
  labels = c("(a)", "(b)")
)

The grouped_ plots follow the Shrink Principle (Tufte, 2001, p.166-7) for high-information graphics, which dictates that the data density and the size of the data matrix can be maximized to exploit maximum resolution of the available data-display technology. Given the large maximum resolution afforded by most computer monitors today, saving grouped_ plots with appropriate resolution ensures no loss in legibility with reduced graphics area.

Graphical integrity (and clean design)

Graphical excellence consists of communicating complex ideas with clarity and in a way that the viewer understands the greatest number of ideas in a short amount of time all the while not quoting the data out of context. The package follows the principles for graphical integrity (as outlined in Tufte, 2001):

There are some instances where ggstatsplot graphs don't follow principles of clean graphics, as formulated in the Tufte theory of data graphics (Tufte, 2001, Chapter 4). The theory has four key principles:

  1. Above all else show the data.
  2. Maximize the data-ink ratio.
  3. Erase non-data-ink.
  4. Erase redundant data-ink, within reason.

In particular, default plots in ggstatsplot can sometimes violate one of the principles from 2-4. According to these principles, every bit of ink should have reason for its inclusion in the graphic and should convey some new information to the viewer. If not, such ink should be removed. One instance of this is bilateral symmetry of data measures. For example, in Figure-1, we can see that both the box and violin plots are mirrored, which consumes twice the space in the graphic without adding any new information. But this redundancy is tolerated for the sake of beauty that such symmetrical shapes can bring to the graphic. Even Tufte admits that efficiency is but one consideration in the design of statistical graphics (Tufte, 2001, p. 137). Additionally, these principles were formulated in an era in which computer graphics had yet to revolutionize the ease with which graphics could be produced and thus some of the concerns about minimizing data-ink for easier production of graphics are not as relevant as they were.

Statistical analysis

As an extension of ggplot2, ggstatsplot has the same expectations about the structure of the data. More specifically,

# creating a new dataset without any NAs in variables of interest
msleep_no_na <-
  dplyr::filter(
    .data = ggplot2::msleep,
    !is.na(sleep_rem),
    !is.na(awake),
    !is.na(brainwt),
    !is.na(bodywt)
  )

# variable names vector
var_names <- c(
  "REM sleep",
  "time awake",
  "brain weight",
  "body weight"
)

# combining two plots
ggstatsplot::combine_plots(
  # plot *without* any NAs
  ggstatsplot::ggcorrmat(
    data = msleep_no_na,
    corr.method = "kendall",
    sig.level = 0.001,
    p.adjust.method = "holm",
    cor.vars = c(sleep_rem, awake:bodywt),
    cor.vars.names = var_names,
    matrix.type = "upper",
    colors = c("#B2182B", "white", "#4D4D4D"),
    title = "Correlalogram for mammals sleep dataset",
    subtitle = "sleep units: hours; weight units: kilograms",
    messages = FALSE
  ),
  # plot *with* NAs
  ggstatsplot::ggcorrmat(
    data = ggplot2::msleep,
    corr.method = "kendall",
    sig.level = 0.001,
    p.adjust.method = "holm",
    cor.vars = c(sleep_rem, awake:bodywt),
    cor.vars.names = var_names,
    matrix.type = "upper",
    colors = c("#B2182B", "white", "#4D4D4D"),
    title = "Correlalogram for mammals sleep dataset",
    subtitle = "sleep units: hours; weight units: kilograms",
    messages = FALSE
  ),
  labels = c("(a)", "(b)"),
  nrow = 1
)

Types of statistics supported

Functions | Description | Parametric | Non-parametric | Robust | Bayes Factor ------- | ------------------ | ---- | ----- | ----| ----- ggbetweenstats | Between group/condition comparisons | $\checkmark$ | $\checkmark$ | $\checkmark$ | $\checkmark$ ggwithinstats | Within group/condition comparisons | $\checkmark$ | $\checkmark$ | $\checkmark$ | $\checkmark$ gghistostats, ggdotplotstats | Distribution of a numeric variable | $\checkmark$ | $\checkmark$ | $\checkmark$ | $\checkmark$ ggcorrmat | Correlation matrix | $\checkmark$ | $\checkmark$ | $\checkmark$ | $\times$ ggscatterstats | Correlation between two variables | $\checkmark$ | $\checkmark$ | $\checkmark$ | $\checkmark$ ggpiestats, ggbarstats | Association between categorical variables | $\checkmark$ | NA | NA | $\checkmark$ ggpiestats, ggbarstats | Equal proportions for categorical variable levels | $\checkmark$ | NA | NA | $\times$ ggcoefstats | Regression model coefficients | $\checkmark$ | $\times$| $\checkmark$ | $\times$

Types of statistical tests supported

Functions | Type | Test | Effect size | 95% CI available? ----------- | ----------- | ------------------ | ------------------ | -----
ggbetweenstats | Parametric | Student's and Welch's t-test | Cohen's d, Hedge's g | $\checkmark$ ggbetweenstats | Parametric | Fisher's and Welch's one-way ANOVA | $$\eta^2, \eta^2_p, \omega^2, \omega^2_p$$ | $\checkmark$ ggbetweenstats | Non-parametric | Mann-Whitney U-test | r | $\checkmark$ ggbetweenstats | Non-parametric | Kruskal-Wallis Rank Sum Test | $$\eta^2_H$$ | $\checkmark$ ggbetweenstats | Robust | Yuen's test for trimmed means | $$\xi$$ | $\checkmark$ ggbetweenstats | Robust | Heteroscedastic one-way ANOVA for trimmed means | $$\xi$$ | $\checkmark$ ggwithinstats | Parametric | Student's t-test | Cohen's d, Hedge's g | $\checkmark$ ggwithinstats | Parametric | Fisher's one-way repeated measures ANOVA | $$\eta^2_p, \omega^2$$ | $\checkmark$ ggwithinstats | Non-parametric | Wilcoxon signed-rank test | r | $\checkmark$ ggwithinstats | Non-parametric | Friedman test | $$W_{Kendall}$$ | $\checkmark$ ggwithinstats | Robust | Yuen's test on trimmed means for dependent samples | $$\xi$$ | $\checkmark$ ggwithinstats | Robust | Heteroscedastic one-way repeated measures ANOVA for trimmed means | $\times$ | $\times$ ggpiestats | Parametric | $$\text{Pearson's}~ \chi^2 ~\text{test}$$ | Cramer's V | $\checkmark$ ggpiestats | Parametric | McNemar's test | Cohen's g | $\checkmark$ ggpiestats | Parametric | One-sample proportion test | Cramer's V | $\checkmark$ ggscatterstats/ggcorrmat | Parametric | Pearson's r | r | $\checkmark$ ggscatterstats/ggcorrmat | Non-parametric | $$\text{Spearman's}~ \rho$$ | $$\rho$$ | $\checkmark$ ggscatterstats/ggcorrmat | Robust | Percentage bend correlation | r | $\checkmark$ gghistostats/ggdotplotstats | Parametric | One-sample t-test | Cohen's d, Hedge's g | $\checkmark$ gghistostats | Non-parametric | One-sample Wilcoxon signed rank test | r | $\checkmark$ gghistostats/ggdotplotstats | Robust | One-sample percentile bootstrap | robust estimator | $\checkmark$ gghistostats/ggdotplotstats | Parametric | Regression models | $$\beta$$ | $\checkmark$

For the ggbetweenstats function, the following post-hoc tests are available for (adjusted) pairwise multiple comparisons:

Type | Design | Equal variance assumed? | Pairwise comparison test | p-value adjustment? ----------- | ----------- | --------- | ----------------------- | ----- Parametric | between-subjects | No | Games-Howell test | $\checkmark$ Parametric | between-subjects | Yes | Student's t-test | $\checkmark$ Parametric | within-subjects | NA | Student's t-test | $\checkmark$ Non-parametric | between-subjects | No | Dwass-Steel-Crichtlow-Fligner test | $\checkmark$ Non-parametric | within-subjects | No | Durbin-Conover test | $\checkmark$ Robust | between-subjects | No | Yuen's trimmed means test | $\checkmark$ Robust | within-subjects | NA | Yuen's trimmed means test | $\checkmark$ Bayes Factor | between-subjects | No | $\times$ | $\times$ Bayes Factor | between-subjects | Yes | $\times$ | $\times$ Bayes Factor | within-subjects | NA | $\times$ | $\times$

Note-

Statistical variation

One of the important functions of a plot is to show the variation in the data, which comes in two forms:

# for reproducibility
set.seed(123)

# plot
ggstatsplot::gghistostats(
  data = morley,
  x = Speed,
  test.value = 792,
  test.value.line = TRUE,
  xlab = "Speed of light (km/sec, with 299000 subtracted)",
  title = "Figure-5: Distribution of Speed of light",
  caption = "Note: Data collected across 5 experiments (20 measurements each)",
  messages = FALSE
)
# for reproducibility
set.seed(123)

# creating model object
mod <- lme4::lmer(
  formula = total.fruits ~ nutrient + rack + (nutrient | popu / gen),
  data = lme4::Arabidopsis
)

# plot
ggstatsplot::ggcoefstats(
  x = mod,
  p.kr = FALSE
)

Reporting results

The default setting in ggstatsplot is to produce plots with statistical details included. Most often than not, the results are displayed as a subtitle in the plot. Great care has been taken into which details are included in statistical reporting and why.

  1. APA guidelines (APA, 2009) are followed (for the most part) by default:
  2. Percentages are displayed with no decimal places (Figure-2).
  3. Correlations, t-tests, and chi-squared tests are reported with the degrees of freedom in parentheses and the significance level (Figure-2, Figure-3b, Figure-5).
  4. ANOVAs are reported with two degrees of freedom and the significance level (Figure-1).
  5. Regression results are presented with the unstandardized or standardized estimate (beta), whichever was specified by the user, along with the statistic (depending on the model, this can be a t or z statistic) and the corresponding significance level (Figure-6).
  6. With the exception of p-values, most statistics are rounded to two decimal places.

  7. Default statistical tests:

  8. Dealing with null results:

  9. Avoiding the "p-value error":
    The p-value indexes the probability that the researchers have falsely rejected a true null hypothesis (Type I error, i.e.) and can rarely be exactly 0. And yet over 97,000 manuscripts on Google Scholar report the p-value to be p = 0.000 (Lilienfeld et al., 2015), putatively due to relying on default computer outputs. All p-values displayed in ggstatsplot plots avoid this mistake. Anything less than p < 0.001 is displayed as such (e.g, Figure-1). The package deems it unimportant how infinitesimally small the p-values are and, instead, puts emphasis on the effect size magnitudes and their 95% CIs.

Overall consistency in API

Attempt has been made to make the application program interface (API) consistent enough that no struggle is expected while thinking about specifying function calls-

Conclusion

Acknowlegdments

References

Appendix

Appendix A: Documentation

There are three main documents one can rely on to learn how to use ggstatsplot:

Appendix B: Suggestions

If you find any bugs or have any suggestions/remarks, please file an issue on GitHub repository for this package: https://github.com/IndrajeetPatil/ggstatsplot/issues

Appendix C: Session information

Summarizing session information for reproducibility.

options(width = 200)
devtools::session_info()


IndrajeetPatil/ggstatsplot documentation built on June 17, 2019, 1:34 p.m.