knitr::opts_chunk$set( collapse = TRUE, comment = "#>", eval = FALSE )
We first explain the output of the comment in the PR:
Every bullet point corresponds to one expression you benchmarked in your
touchstone script in touchstone/script.R
with benchmark_run()
. The
contents of the former are in this (simplified) example:
library(touchstone) branches_install() benchmark_run( expr_before_benchmark = c("library(styler)", "cache_deactivate()"), without_cache = 'styler::style_pkg("touchstone/sources/here")', n = 30 ) benchmark_run( expr_before_benchmark = c("library(styler)", "cache_activate()"), cache_applying = 'styler::style_pkg("touchstone/sources/here")', n = 30 ) benchmark_run( expr_before_benchmark = c("library(styler)", "cache_deactivate()"), cache_recording = c( "gert::git_reset_hard(repo = 'touchstone/sources/here')", 'styler::style_pkg("touchstone/sources/here")' ), n = 30 ) benchmarks_analyze()
You can see that the expression cache_applying
took on average 0.1 seconds
on both branches.
And if we merge this pull request, cache_recording
and without_cache
will
be approximately 0.01 quicker than before the merge.
Imagine we only ran all expressions once per branch. Then, the differences could
be by chance. To avoid that, we run it n
times and compute a confidence
interval, which tells us how certain we can be about our estimated differences.
A 95% confidence interval tells us that if we were to repeat the benchmarking
experiment 100 times, our estimated speed change would be in the interval 95 out
of 100 times. It comes from a simple ANOVA model where we regress the elapsed
time on the branch from which the result comes. Like this:
fit <- stats::aov(elapsed ~ factor(branches), data = timings) confint(fit, ...)
We could in addition also control for block
, but as we only have two
observations per block, this required a lot of parameter estimates and would take
away statistical power from our parameter of interest. Our estimates are also
unbiased without it.
Since changes in percent are more relevant than absolute changes, the confidence
interval is reported relative to the target branch of the pull request. The
measured difference is statistically significantly different from zero if the
confidence interval does not overlap with 0. In the screenshot above, this is
not the case for any benchmarked expression, as they all range from some
negative to some positive number. If you increase n
, you can estimate the
speed implications of a pull request more precisely, meaning you'll get a more
narrow confidence interval which more likely does not cover zero. You'll then
easily reach statistical significance (and the CI run will take longer), but do
you really care if your code gets 0.000001% slower? Probably not. That's why you
also need to look at the range of the confidence interval.
We sample both branches in n
blocks. This gives us a more efficient estimate of
the speed difference than the completely at random, because completely at random
can result in or close to one of the following scenarios assuming there is an
upwards trend in available compute resources over the whole period:
We run the PR branch n
times before we run the target branch, the PR branch
is at disadvantage.
If we always switch branches, the first branch is at disadvantage.
The opposite effect occurs when compute power steadily decreases. So we sample randomly within blocks:
touchstone:::branches_upsample(c("main", "feature"))
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.