#| label = "setup",
#| include = FALSE

source("../setup.R")
#| label = "suggested_pkgs",
#| include = FALSE

pkgs <- "PMCMRplus"

successfully_loaded <- purrr::map_lgl(pkgs, requireNamespace, quietly = TRUE)
can_evaluate <- all(successfully_loaded)

if (can_evaluate) {
  purrr::walk(pkgs, library, character.only = TRUE)
} else {
  knitr::opts_chunk$set(eval = FALSE)
}

You can cite this package/vignette as:

#| label = "citation",
#| echo = FALSE,
#| comment = ""
citation("ggstatsplot")

This vignette is still work in progress.

Graphic design principles

Graphical perception

Graphical perception involves visual decoding of the encoded information in graphs. {ggstatsplot} incorporates the paradigm proposed in ([@cleveland1985], Chapter 4) to facilitate making visual judgments about quantitative information effortless and almost instantaneous. Based on experiments, Cleveland proposes that there are ten elementary graphical-perception tasks that we perform to visually decode quantitative information in graphs (organized from most to least accurate; [@cleveland1985], p.254)-

So the key principle of Cleveland's paradigm for data display is-

"We should encode data on a graph so that the visual decoding involves [graphical-perception] tasks as high in the ordering as possible."

For example, decoding the data point values in ggbetweenstats requires position judgments along a common scale:

#| label = "fig1",
#| fig.height = 9,
#| fig.width = 10,
#| fig.cap = "Note that assessing differences in mean values between groups has been made easier
#| with the help of \\textit{position} of data points along a common scale (the Y-axis) and
#| labels."

ggbetweenstats(
  data = dplyr::filter(
    movies_long,
    genre %in% c("Action", "Action Comedy", "Action Drama", "Comedy")
  ),
  x = genre,
  y = rating,
  title = "IMDB rating by film genre",
  xlab = "Genre",
  ylab = "IMDB rating (average)"
)

There are few instances where {ggstatsplot} diverges from recommendations made in Cleveland's paradigm:

#| label = "fig2",
#| fig.height = 4,
#| fig.width = 10,
#| fig.cap = "Pie charts don't follow Cleveland's paradigm to data display because they rely on
#| less accurate angle judgments. `{ggstatsplot}` sidesteps this issue by always labelling
#| percentages for pie slices, which makes angle judgments unnecessary."

ggpiestats(
  data = movies_long,
  x = genre,
  y = mpaa,
  title = "Distribution of MPAA ratings by film genre",
  legend.title = "layout"
)
#| label = "fig3",
#| fig.height = 12,
#| fig.width = 10,
#| fig.cap = "Comparing different aspects of data is much more accurate in (\\textit{a}) a
#| \\textit{superposed} plot, which is recommended in Cleveland's paradigm, than in (\\textit{b})
#| a \\textit{juxtaposed} plot, which is how it is implemented in `{ggstatsplot}` package. This is
#| because displaying detailed results from statistical tests would be difficult in a superposed
#| plot."
library(ggplot2)


## creating a smaller data frame
df <- dplyr::filter(movies_long, genre %in% c("Comedy", "Drama"))

combine_plots(
  plotlist = list(
    # superposition
    ggplot(data = df, mapping = aes(x = length, y = rating, color = genre)) +
      geom_jitter(size = 3, alpha = 0.5) +
      geom_smooth(method = "lm") +
      labs(title = "superposition (recommended in Cleveland's paradigm)") +
      theme_ggstatsplot(),
    # juxtaposition
    grouped_ggscatterstats(
      data = df,
      x = length,
      y = rating,
      grouping.var = genre,
      marginal = FALSE,
      annotation.args = list(title = "juxtaposition (`{ggstatsplot}` implementation in `grouped_` functions)")
    )
  ),
  ## combine for comparison
  annotation.args = list(title = "Two ways to compare different aspects of data"),
  plotgrid.args = list(nrow = 2)
)

The grouped_ plots follow the Shrink Principle ([@tufte2001], p.166-7) for high-information graphics, which dictates that the data density and the size of the data matrix can be maximized to exploit maximum resolution of the available data-display technology. Given the large maximum resolution afforded by most computer monitors today, saving grouped_ plots with appropriate resolution ensures no loss in legibility with reduced graphics area.

Graphical excellence

Graphical excellence consists of communicating complex ideas with clarity and in a way that the viewer understands the greatest number of ideas in a short amount of time all the while not quoting the data out of context. The package follows the principles for graphical integrity [@tufte2001]:

p.44-45). This is achieved by using ggrepel package to place labels in a way that reduces their visual prominence.

There are some instances where {ggstatsplot} graphs don't follow principles of clean graphics, as formulated in the Tufte theory of data graphics ([@tufte2001], Chapter 4). The theory has four key principles:

  1. Above all else show the data.

  2. Maximize the data-ink ratio.

  3. Erase non-data-ink.

  4. Erase redundant data-ink, within reason.

In particular, default plots in {ggstatsplot} can sometimes violate one of the principles from 2-4. According to these principles, every bit of ink should have reason for its inclusion in the graphic and should convey some new information to the viewer. If not, such ink should be removed. One instance of this is bilateral symmetry of data measures. For example, in the figure below, we can see that both the box and violin plots are mirrored, which consumes twice the space in the graphic without adding any new information. But this redundancy is tolerated for the sake of beauty that such symmetrical shapes can bring to the graphic. Even Tufte admits that efficiency is but one consideration in the design of statistical graphics ([@tufte2001],

p. 137). Additionally, these principles were formulated in an era in which computer graphics had yet to revolutionize the ease with which graphics could be produced and thus some of the concerns about minimizing data-ink for easier production of graphics are not as relevant as they were.

Statistical variation

One of the important functions of a plot is to show the variation in the data, which comes in two forms:

#| label = "fig5",
#| fig.height = 6,
#| fig.width = 8,
#| fig.cap = "Distribution of a variable shown using `gghistostats`."

gghistostats(
  data = morley,
  x = Speed,
  test.value = 792,
  xlab = "Speed of light (km/sec, with 299000 subtracted)",
  title = "Distribution of measured Speed of light",
  caption = "Note: Data collected across 5 experiments (20 measurements each)"
)
#| label = "fig6",
#| fig.height = 5,
#| fig.width = 5,
#| fig.cap = "Sample-to-sample variation in regression estimates is displayed using confidence
#| intervals in `ggcoefstats()`."

model <- lme4::lmer(
  formula = total.fruits ~ nutrient + rack + (nutrient | gen),
  data = lme4::Arabidopsis
)

ggcoefstats(model)

Statistical analysis

Data requirements

As an extension of {ggplot2}, {ggstatsplot} has the same expectations about the structure of the data. More specifically,

#| label = "fig4",
#| fig.height = 5,
#| fig.width = 10,
#| fig.cap = "`{ggstatsplot}` functions remove `NA`s from variables of interest and display total
#| sample size \\textit{n}, but they can give more nuanced information about sample sizes when
#| \\textit{n} differs across tests. For example, `ggcorrmat` will display (\\textit{a}) only one
#| total sample size once when no `NA`s present, but (\\textit{b}) will instead show minimum,
#| median, and maximum sample sizes across all correlation tests when `NA`s are present across
#| correlation variables."

## creating a new dataset without any NAs in variables of interest
msleep_no_na <-
  dplyr::filter(
    ggplot2::msleep,
    !is.na(sleep_rem), !is.na(awake), !is.na(brainwt), !is.na(bodywt)
  )

## variable names vector
var_names <- c("REM sleep", "time awake", "brain weight", "body weight")

## combining two plots using helper function in `{ggstatsplot}`
combine_plots(
  plotlist = purrr::pmap(
    .l = list(data = list(msleep_no_na, ggplot2::msleep)),
    .f = ggcorrmat,
    cor.vars = c(sleep_rem, awake:bodywt),
    cor.vars.names = var_names,
    colors = c("#B2182B", "white", "#4D4D4D"),
    title = "Correlalogram for mammals sleep dataset",
    subtitle = "sleep units: hours; weight units: kilograms"
  ),
  plotgrid.args = list(nrow = 1)
)

Statistical reporting

But why would combining statistical analysis with data visualization be helpful? We list few reasons below-

The default setting in {ggstatsplot} is to produce plots with statistical details included. Most often than not, these results are displayed as a subtitle in the plot. Great care has been taken into which details are included in statistical reporting and why.

Template for reporting statistical details

APA guidelines [@apa2009] are followed by default while reporting statistical details:

Dealing with null results:

All functions therefore by default return Bayesian in favor of the null hypothesis by default. If the null hypothesis can't be rejected with the null hypothesis significance testing (NHST) approach, the Bayesian approach can help index evidence in favor of the null hypothesis (i.e., $BF_{01}$). By default, natural logarithms are shown because Bayesian values can sometimes be pretty large. Having values on logarithmic scale also makes it easy to compare evidence in favor alternative ($BF_{10}$) versus null ($BF_{01}$) hypotheses (since $log_{e}(BF_{01}) = - log_{e}(BF_{01})$).

Suggestions

If you find any bugs or have any suggestions/remarks, please file an issue on GitHub: https://github.com/IndrajeetPatil/ggstatsplot/issues



IndrajeetPatil/ggstatplot documentation built on April 26, 2024, 10:27 a.m.