knitr::opts_chunk$set(
  collapse = TRUE,
  out.width = "100%",
  dpi = 450,
  comment = "#>",
  message = FALSE,
  warning = FALSE
)

Introduction: The why and the what

Why the Bayesian Framework?

The Bayesian statistical framework is quickly gaining in popularity among scientists. A number of reasons have been outlined as to why one should prefer this approach:

In general, the frequentist approach has been associated with an exclusive focus on the null hypothesis testing (NHST). Additionally, the misuse of p-values - a key pillar of NHST - has been shown to critically contribute to the reproducibility crisis in social and psychological sciences [@chambers2014instead; @szucs2016empirical]. There is an emerging consensus that adoption of the Bayesian approach is one way of overcoming these issues [@benjamin2018redefine; @etz2016bayesian].

What is the Bayesian Framework?

Adopting the Bayesian framework is more of a shift in the paradigm than a change in the methodology. Indeed, all the common statistical procedures (t-tests, correlations, ANOVAs, regressions, etc.) can be performed inthe Bayesian framework. The key difference is that in the frequentist framework (the "classical" approach to statistics), the effects are fixed (but unknown) and data are random. In other words, it assumes that the unknown parameter has a unique value that we are trying to estimate using our sample data. On the other hand, in the Bayesian framework, instead of estimating the "true effect", the probability of different effects given the observed data is computed, resulting in a distribution of possible values for the parameters, called the posterior distribution.

The uncertainty in Bayesian inference can be summarized, for instance, by the median of the distribution, as well as a range of values of the posterior distribution that includes the 95\% most probable values (the 95\% credible interval). Cum grano salis, these are considered the counterparts to the point-estimate and confidence interval in a frequentist framework. To illustrate the difference of interpretation, the Bayesian framework allows to say "given the observed data, the effect has 95\% probability of falling within this range", while the frequentist (less intuitive) alternative would be "when repeatedly computing confidence intervals from data of this sort, there is a 95\% probability that the effect falls within a given range". In essence, the Bayesian sampling algorithms (such as MCMC sampling) return a probability distribution (the posterior) of an effect that is compatible with the observed data. Thus, an effect can be described by characterizing its posterior distribution in relation to its centrality (point-estimates), uncertainty, as well as its existence and significance

In other words, putting the maths behind it aside for a moment, we can say that the frequentist approach tries to estimate the real effect. For instance, the "real" value of the correlation between x and y. Hence, the frequentist models return a point-estimate (i.e., a single value and not a distribution) of the "real" correlation (e.g., $r = 0.42$) estimated under a number of assumptions.

The Bayesian framework assumes no such thing. The data are what they are. Based on the observed data (and a prior belief about the result), the Bayesian sampling algorithm (MCMC sampling is one example) returns a probability distribution (called "the posterior") of the effect that is compatible with the observed data. For the correlation between x and y, it will return a distribution that says, for example, "the most probable effect is 0.42, but this data is also compatible with correlations of 0.12 and 0.74 with certain probabilities". To characterize statistical significance of our effects, we do not need p-values, or any other such indices. We simply describe the posterior distribution of the effect.

How to do Bayesian analysis

Once you've installed the necessary packages, we can load rstanarm (to fit Bayesian regression models), bayestestR (to compute useful indices), and insight (to access the parameters).

library(rstanarm)
library(bayestestR)
library(insight)

Simple linear (regression) model

We will begin by conducting a simple linear regression to test the relationship between Petal.Length (our predictor or independent variable) and Sepal.Length (our response, or dependent, variable) from iris dataset which is included by default in R.

Let's start by fitting a frequentist version of the model for a reference:

model <- lm(Sepal.Length ~ Petal.Length, data = iris)
summary(model)

We can also zoom in on the parameters of interest to us:

insight::get_parameters(model)

In this model, the linear relationship between Petal.Length and Sepal.Length is positive and significant ($\beta = 0.41, t(148) = 21.6, p < .001$). This means that for each one-unit increase in Petal.Length (the predictor), you can expect Sepal.Length (the response) to increase by 0.41. This effect can be visualized by plotting the predictor values on the x axis and the response values as y using the ggplot2 package:

library(ggplot2) # Load the package

# The ggplot function takes the data as argument, and then the variables
# related to aesthetic features such as the x and y axes.
ggplot(iris, aes(x = Petal.Length, y = Sepal.Length)) +
  geom_point() + # This adds the points
  geom_smooth(method = "lm") # This adds a regression line

Now let's fit a Bayesian version of the model by using the stan_glm function in the {rstanarm} package:

model <- stan_glm(Sepal.Length ~ Petal.Length, data = iris)

You can see the sampling algorithm being run.

Extracting the posterior

Once it is done, let us extract the parameters (i.e., coefficients) of the model.

posteriors <- insight::get_parameters(model)

head(posteriors) # Show the first 6 rows

As we can see, the parameters take the form of a lengthy dataframe with two columns, corresponding to the intercept and the effect of Petal.Length. These columns contain the posterior distributions of these two parameters. In simple terms, the posterior distribution is a set of different plausible values for each parameter. Contrast this with the result we saw from the frequentist linear regression mode using lm, where the results had a single value for each effect of the model, and not a distribution of values. This is one of the most important differences between these two frameworks.

Posterior draws

Let's look at the length of the posteriors.

nrow(posteriors) # Size (number of rows)

Why is the size 4000, and not more or less?

First of all, these observations (the rows) are usually referred to as posterior draws. The underlying idea is that the Bayesian sampling algorithm (e.g., Monte Carlo Markov Chains - MCMC) will draw from the hidden true posterior distribution. Thus, it is through these posterior draws that we can estimate the underlying true posterior distribution. Therefore, the more draws you have, the better your estimation of the posterior distribution. However, increased draws also means longer computation time.

If we look at the documentation (?sampling) for the rstanarm's "sampling" algorithm used by default in the model above, we can see several parameters that influence the number of posterior draws. By default, there are 4 chains (you can see it as distinct sampling runs), that each create 2000 iter (draws). However, only half of these iterations are kept, as half are used for warm-up (the convergence of the algorithm). Thus, the total for posterior draws equals 4 chains * (2000 iterations - 1000 warm-up) = 4000.

We can change that, for instance:

model <- stan_glm(
  formula = Sepal.Length ~ Petal.Length,
  data = iris,
  chains = 2,
  iter = 1000,
  warmup = 250
)

nrow(insight::get_parameters(model)) # Size (number of rows)

In this case, as would be expected, we have 2 chains * (1000 iterations - 250 warm-up) = 1500 posterior draws. But let's keep our first model with the default setup (as it has more draws).

Visualizing the posterior distribution

Now that we've understood where these values come from, let's look at them. We will start by visualizing the posterior distribution of our parameter of interest, the effect of Petal.Length.

ggplot(posteriors, aes(x = Petal.Length)) +
  geom_density(fill = "orange")

This distribution represents the probability (the y axis) of different effects (the x axis). The central values are more probable than the extreme values. As you can see, this distribution ranges from about 0.35 to 0.50, with the bulk of it being at around 0.41. This is it, we've just described your first posterior distribution.

And this is the heart of Bayesian analysis. We don't need p-values, t-values, or degrees of freedom. Everything we need is contained within this posterior distribution.

Our description above is consistent with the values obtained from the frequentist regression (which resulted in a $\beta$ of 0.41). This is reassuring! Indeed, in most cases, Bayesian analysis does not drastically differ from the frequentist results or their interpretation. Rather, it makes the results more interpretable and intuitive, and easier to understand and describe. We can now go ahead and precisely characterize this posterior distribution.

Describing the Posterior

Unfortunately, it is often not practical to report the whole posterior distributions as graphs. We need to find a concise way to summarize it. We recommend to describe the posterior distribution with 3 elements:

  1. A point-estimate which is a one-value summary (similar to the $beta$ in frequentist regressions).
  2. A credible interval representing the associated uncertainty.
  3. Some indices of significance, giving information about the relative importance of this effect.

Point-estimate

What single value can best represent my posterior distribution?

Centrality indices, such as the mean, the median, or the mode are usually used as point-estimates. But what's the difference between them?

Let's answer this by first inspecting the mean:

mean(posteriors$Petal.Length)

This is close to the frequentist $\beta$. But, as we know, the mean is quite sensitive to outliers or extremes values. Maybe the median could be more robust?

median(posteriors$Petal.Length)

Well, this is very close to the mean (and identical when rounding the values). Maybe we could take the mode, that is, the peak of the posterior distribution? In the Bayesian framework, this value is called the Maximum A Posteriori (MAP). Let's see:

map_estimate(posteriors$Petal.Length)

They are all very close!

Let's visualize these values on the posterior distribution:

ggplot(posteriors, aes(x = Petal.Length)) +
  geom_density(fill = "orange") +
  # The mean in blue
  geom_vline(xintercept = mean(posteriors$Petal.Length), color = "blue", size = 1) +
  # The median in red
  geom_vline(xintercept = median(posteriors$Petal.Length), color = "red", size = 1) +
  # The MAP in purple
  geom_vline(xintercept = map_estimate(posteriors$Petal.Length), color = "purple", size = 1)

Well, all these values give very similar results. Thus, we will choose the median, as this value has a direct meaning from a probabilistic perspective: there is 50\% chance that the true effect is higher and 50\% chance that the effect is lower (as it divides the distribution in two equal parts).

Uncertainty

Now that the have a point-estimate, we have to describe the uncertainty. We could compute the range:

range(posteriors$Petal.Length)

But does it make sense to include all these extreme values? Probably not. Thus, we will compute a credible interval. Long story short, it's kind of similar to a frequentist confidence interval, but easier to interpret and easier to compute — and it makes more sense.

We will compute this credible interval based on the Highest Density Interval (HDI). It will give us the range containing the 89% most probable effect values. Note that we will use 89% CIs instead of 95% CIs (as in the frequentist framework), as the 89% level gives more stable results [@kruschke2014doing] and reminds us about the arbitrariness of such conventions [@mcelreath2018statistical].

hdi(posteriors$Petal.Length, ci = 0.89)

Nice, so we can conclude that the effect has 89% chance of falling within the [0.38, 0.44] range. We have just computed the two most important pieces of information for describing our effects.

Effect significance

However, in many scientific fields it not sufficient to simply describe the effects. Scientists also want to know if this effect has significance in practical or statistical terms, or in other words, whether the effect is important. For instance, is the effect different from 0? So how do we assess the significance of an effect. How can we do this?

Well, in this particular case, it is very eloquent: all possible effect values (i.e., the whole posterior distribution) are positive and over 0.35, which is already substantial evidence the effect is not zero.

But still, we want some objective decision criterion, to say if yes or no the effect is 'significant'. One approach, similar to the frequentist framework, would be to see if the Credible Interval contains 0. If it is not the case, that would mean that our effect is 'significant'.

But this index is not very fine-grained, no? Can we do better? Yes!

A linear model with a categorical predictor

Imagine for a moment you are interested in how the weight of chickens varies depending on two different feed types. For this example, we will start by selecting from the chickwts dataset (available in base R) two feed types of interest for us (we do have peculiar interests): meat meals and sunflowers.

Data preparation and model fitting

library(dplyr)

# We keep only rows for which feed is meatmeal or sunflower
data <- filter(chickwts, feed %in% c("meatmeal", "sunflower"))

Let's run another Bayesian regression to predict the weight with the two types of feed type.

model <- stan_glm(weight ~ feed, data = data)
model <- stan_glm(weight ~ feed, data = data)

Posterior description

posteriors <- insight::get_parameters(model)

ggplot(posteriors, aes(x = feedsunflower)) +
  geom_density(fill = "red")

This represents the posterior distribution of the difference between meatmeal and sunflowers. It seems that the difference is positive (since the values are concentrated on the right side of 0). Eating sunflowers makes you more fat (at least, if you're a chicken). But, by how much?

Let us compute the median and the CI:

median(posteriors$feedsunflower)
hdi(posteriors$feedsunflower)

It makes you fat by around 51 grams (the median). However, the uncertainty is quite high: there is 89% chance that the difference between the two feed types is between 14 and 91. Is this effect different from 0?

ROPE Percentage

Testing whether this distribution is different from 0 doesn't make sense, as 0 is a single value (and the probability that any distribution is different from a single value is infinite).

However, one way to assess significance could be to define an area around 0, which will consider as practically equivalent to zero (i.e., absence of, or a negligible, effect). This is called the Region of Practical Equivalence (ROPE), and is one way of testing the significance of parameters.

How can we define this region? We know that we can define the ROPE as the [-20, 20] range. All effects within this range are considered as null (negligible). We can now compute the proportion of the 89% most probable values (the 89% CI) which are not null, i.e., which are outside this range.

rope(posteriors$feedsunflower, range = c(-20, 20), ci = 0.89)

5% of the 89% CI can be considered as null. Is that a lot? Based on our guidelines, yes, it is too much. Based on this particular definition of ROPE, we conclude that this effect is not significant (the probability of being negligible is too high).

That said, to be honest, I have some doubts about this Prof. Sanders. I don't really trust his definition of ROPE. Is there a more objective way of defining it?

Yes! One of the practice is for instance to use the tenth (1/10 = 0.1) of the standard deviation (SD) of the response variable, which can be considered as a "negligible" effect size [@cohen1988statistical].

rope_value <- 0.1 * sd(data$weight)
rope_range <- c(-rope_value, rope_value)
rope_range

Let's redefine our ROPE as the region within the [-6.2, 6.2] range. Note that this can be directly obtained by the rope_range function :)

rope_value <- rope_range(model)
rope_value

Let's recompute the percentage in ROPE:

rope(posteriors$feedsunflower, range = rope_range, ci = 0.89)

With this reasonable definition of ROPE, we observe that the 89\% of the posterior distribution of the effect does not overlap with the ROPE. Thus, we can conclude that the effect is significant (in the sense of important enough to be noted).

Probability of Direction (pd)

Maybe we are not interested in whether the effect is non-negligible. Maybe we just want to know if this effect is positive or negative. In this case, we can simply compute the proportion of the posterior that is positive, no matter the "size" of the effect.

n_positive <- posteriors %>%
  filter(feedsunflower > 0) %>% # select only positive values
  nrow() # Get length

n_positive / nrow(posteriors) * 100

We can conclude that the effect is positive with a probability of 98%. We call this index the Probability of Direction (pd). It can, in fact, be computed more easily with the following:

p_direction(posteriors$feedsunflower)

Interestingly, it so happens that this index is usually highly correlated with the frequentist p-value. We could almost roughly infer the corresponding p-value with a simple transformation:

pd <- 97.82
onesided_p <- 1 - pd / 100
twosided_p <- onesided_p * 2
twosided_p

If we ran our model in the frequentist framework, we should approximately observe an effect with a p-value of r round(twosided_p, digits=3). Is that true?

Comparison to frequentist

summary(lm(weight ~ feed, data = data))

The frequentist model tells us that the difference is positive and significant ($\beta = 52, p = 0.04$).

Although we arrived to a similar conclusion, the Bayesian framework allowed us to develop a more profound and intuitive understanding of our effect, and of the uncertainty of its estimation.

All with one function

And yet, I agree, it was a bit tedious to extract and compute all the indices. But what if I told you that we can do all of this, and more, with only one function? Behold, describe_posterior!

This function computes all of the adored mentioned indices, and can be run directly on the model:

describe_posterior(model, test = c("p_direction", "rope", "bayesfactor"))

There we have it! The median, the CI, the pd and the ROPE percentage.

Correlations

Frequentist version

Once again, let us begin with a frequentist correlation between two continuous variables, the width and the length of the sepals of some flowers. The data is available in R as the iris dataset.

We will compute a Pearson's correlation test, store the results in an object called result, and then display it:

result <- cor.test(iris$Sepal.Width, iris$Sepal.Length)
result

As you can see in the output, the test actually compared two hypotheses: - the null hypothesis (h0; no correlation), - the alternative hypothesis (h1; a non-null correlation).

Based on the p-value, the null hypothesis cannot be rejected: the correlation between the two variables is negative but non-significant ($r = -.12, p > .05$).

Bayesian correlation

To compute a Bayesian correlation test, we will need the BayesFactor package (you can install it by running install.packages("BayesFactor")). We can then load this package, compute the correlation using the correlationBF() function, and store the result.

library(BayesFactor)
result <- correlationBF(iris$Sepal.Width, iris$Sepal.Length)

Now, let us run our describe_posterior() function on that:

describe_posterior(result)

We see again many things here, but the important indices for now are the median of the posterior distribution, -.11. This is (again) quite close to the frequentist correlation. We could, as previously, describe the credible interval, the pd, or the ROPE percentage, but we will focus here on another index provided by the Bayesian framework, the Bayes Factor (BF).

Bayes Factor (BF)

We said previously that a correlation test actually compares two hypotheses, a null (absence of effect) with an alternative one (presence of an effect). The Bayes factor (BF) allows the same comparison and determines under which of these two models the observed data are more probable: a model with the effect of interest, and a null model without the effect of interest. So, in the context of our correlation example, the null hypothesis would be no correlation between the two variables ($h0: \rho = 0$; where $\rho$ stands for Bayesian correlation coefficient), while the alternative hypothesis would be that there is a correlation different than 0 - positive or negative ($h1: \rho \neq 0$).

We can use bayesfactor() to specifically compute the Bayes factor comparing those models:

bayesfactor(result)

We got a BF of 0.51. What does it mean?

Bayes factors are continuous measures of relative evidence, with a Bayes factor greater than 1 giving evidence in favour of one of the models (often referred to as the numerator), and a Bayes factor smaller than 1 giving evidence in favour of the other model (the denominator).

That's one of the reason why the Bayesian framework is sometimes considered as superior to the frequentist framework. Remember from your stats lessons, that the p-value can only be used to reject h0, but not accept it. With the Bayes factor, you can measure evidence against - and in favour of - the null. In other words, in the frequentist framework, if the p-value is not significant, we can conclude that evidence for the effect is absent, but not that there is evidence for the absence of the effect. In Bayesian framework, we can do the latter. This is important since sometimes our hypotheses are about no effect.

BFs representing evidence for the alternative against the null can be reversed using $BF_{01}=1/BF_{10}$ (the 01 and 10 correspond to h0 against h1 and h1 against h0, respectively) to provide evidence of the null against the alternative. This improves human readability^[If the effect is really strong, the BF values can be extremely high. So don't be surprised if you see BF values that have been log-transformed to make them more human readable.] in cases where the BF of the alternative against the null is smaller than 1 (i.e., in support of the null).

In our case, BF = 1/0.51 = 2, indicates that the data are 2 times more probable under the null compared to the alternative hypothesis, which, though favouring the null, is considered only anecdotal evidence against the null.

We can thus conclude that there is anecdotal evidence in favour of an absence of correlation between the two variables (rmedian = 0.11, BF = 0.51), which is a much more informative statement that what we can do with frequentist statistics.

And that's not all!

Visualise the Bayes factor

In general, pie charts are an absolute no-go in data visualisation, as our brain's perceptive system heavily distorts the information presented in such way^[An exception would be when the pie slices are well-labeled so that our brain's perception system does not have to do the decoding work.]. Nevertheless, there is one exception: pizza charts. It is an intuitive way of interpreting the strength of evidence provided by BFs as an amount of surprise. Such "pizza plots" can be directly created through the see visualisation companion package for easystats (you can install it by running install.packages("see")):

library(see)

plot(bayesfactor(result)) +
  scale_fill_pizza()

So, after seeing this pizza, how much would you be surprised by the outcome of a blinded poke?

t-tests

Versicolor vs. virginica

Bayesian t-tests can be performed in a very similar way to correlations. As we are particularly interested in two levels of the Species factor, versicolor and virginica. We will start by filtering out from iris the non-relevant observations corresponding to the setosa specie, and we will then visualise the observations and the distribution of the Sepal.Width variable.

library(dplyr)
library(ggplot2)

# Select only two relevant species
data <- iris %>%
  filter(Species != "setosa") %>%
  droplevels()

# Visualise distributions and observations
data %>%
  ggplot(aes(x = Species, y = Sepal.Width, fill = Species)) +
  geom_violindot(fill_dots = "black", size_dots = 1) +
  scale_fill_material() +
  theme_modern()

It seems (visually) that virgnica flowers have, on average, a slightly higer width of sepals. Let's assess this difference statistically by using the ttestBF() function in the BayesFactor package.

Compute the Bayesian t-test

result <- BayesFactor::ttestBF(formula = Sepal.Width ~ Species, data = data)
describe_posterior(result)

From the indices, we can say that the difference of Sepal.Width between virginica and versicolor has a probability of 100% of being negative [from the pd and the sign of the median] (median = -0.19, 89% CI [-0.29, -0.092]). The data provides a strong evidence against the null hypothesis (BF = 18).

Keep that in mind as we will see another way of investigating this question.

Logistic Model

A hypothesis for which one uses a t-test can also be tested using a binomial model (e.g., a logistic model). Indeed, it is possible to reformulate the following hypothesis, "there is an important difference in this variable between the two groups" with the hypothesis "this variable is able to discriminate between (or classify) the two groups". However, these models are much more powerful than a t-test.

In the case of the difference of Sepal.Width between virginica and versicolor, the question becomes, how well can we classify the two species using only Sepal.Width.

Fit the model

library(rstanarm)

model <- stan_glm(
  Species ~ Sepal.Width,
  data = data,
  family = "binomial",
  refresh = 0
)

Visualise the model

Using the modelbased package.

library(modelbased)

vizdata <- estimate_relation(model)

ggplot(vizdata, aes(x = Sepal.Width, y = Predicted)) +
  geom_ribbon(aes(ymin = CI_low, ymax = CI_high), alpha = 0.5) +
  geom_line() +
  ylab("Probability of being virginica") +
  theme_modern()

Performance and Parameters

Once again, we can extract all indices of interest for the posterior distribution using our old pal describe_posterior().

describe_posterior(model, test = c("pd", "ROPE", "BF"))
library(performance)

model_performance(model)

Visualise the indices

TO DO.

library(see)

plot(rope(result))

Diagnostic Indices

About diagnostic indices such as Rhat and ESS.

Acknowledgments

see is part of the collaborative easystats ecosystem. Thus, we thank the members of easystats as well as the users.

References



DominiqueMakowski/bayestestR documentation built on Feb. 12, 2024, 8:27 a.m.