knitr::opts_chunk$set( collapse = TRUE, comment = "#>" )
This vignette quickly presents the rationale of the package then jump to a case study. If you are reading it within R/RStudio, note that pataqu has a dedicated website that gathers the doc with examples, vignettes, news, etc.
http://vbonhomme.github.io/pataqu
We chiefly had in mind archaeological data which dating of remains are often temporally bounded between two certain events, named terminus post/ante quem, further abbreviated tpq and taq^[hence the package name which is also a delicate pun, in French]. The tpq is the earliest date, the taq the latest.
The real event may have happened anytime between these two boundaries with a flat, uniform, density of probability along the interval.
We can also think of radiocarbon dating, which comes with a prediction and a confidence interval (say $\mu ± \sigma$). But here the dating differ and the real event has a density of probability of gaussian nature with parameters $\mu$ and $\sigma$.
More generally, we will show how to deal with arbitrary uncertainties on x, whenever you have an idea on how it varies.
Whatever the source of the uncertainty on x, the graphics, tests and overall stories we can obtain from data may well be affected.
How robust is your lovely story regarding x uncertainties?
pataqu aims at visualizing and testing this using permutations. This is not the only way to do it but in our view it has the merit to only use information contained in the data itself.
The idea is:
We will load the package and also use dplyr
and ggplot2
from the tidyverse. More generally, if you library(tidyverse)
you should not need much more and typically you will not need to dplyr::
or ggplot2::
like what you have in function examples. We treated well the examples so that they are meaningful.
library(dplyr) library(ggplot2) library(pataqu)
animals
datasetThis dataset is made of real (and still unpublished) data where researchers measured a value
of interest on archaeological remains, in several sites
(with us
or stratigraphical units in them), dated with tpq
and taq
. The remains belonged to four taxa
.
head(animals) # We only show the first lines but you can View(animals)
How would we treat that? We could decide to display the full interval and see what happens:
# pure cosmetics for lighter graphs theme_set(theme_minimal()) ggplot(animals) + aes(xmin=tpq, xmax=taq, y=value, col=taxa) + geom_errorbarh(size=0.2, alpha=0.5)
What a mess! We could try to add a mid point and add a smoother on top of them.
animals %>% mutate(x_new=(tpq+taq)/2) %>% ggplot() + aes(x=x_new, y=value, col=taxa) + geom_point(size=0.1) + geom_smooth(method="loess", formula="y~x", se=FALSE) -> gg_mid gg_mid
We can now see some trends but we lost the uncertainty on x in the meantime. The graph before may have also been the one below, with the same likelihood.
We take animals and simply draw other x_new
values, not on the midpoint but somewhere between the tpq and taq of each observation We use set.seed
for the sake of replicability only.
set.seed(2329) # lets draw new x values animals2 <- animals %>% dplyr::rowwise() %>% dplyr::mutate(x_new=stats::runif(n=1, min=tpq, max=taq)) %>% dplyr::ungroup() # and redo gg_mid graph with this new tibble gg_mid %+% animals2
Same same but different.
shake
: generates a new datasetThe code used above is exactly what shake_uniform
is made of behind the curtain. You can check by yourself and type pataqu::shake_uniform
(no brackets) in the console. Have a look to the different shakers with ?shake
.
animals shake_uniform(animals, min=tpq, max=taq)
pataqu generalizes this idea with decoration around this pattern.
quake
: generates many datasetsYou can do several shakes at once with the function quake
which is the real entry point of analyses with pataqu.
Here we will simulate only 5 new datasets for the sake of speed but you can generate thousands of them^[On my 2013 Mac Book Pro it takes 15 sec. to quake animals and have half a million observations, so this is reasonnably fast] with randomized datations.
quake
uses one of the shakers to generate entire datasets, possibly thousands of them. Here, we have tpq
and taq
dating so we will use shake_uniform
and we have to specify the name of the columns^[Arguments names (here min
and max
can be omitted as long as they come in the order expected by shake_uniform
but I think it is a bad idea and not so painful to type].
many_animals <- quake(animals, k=5, shaker=shake_uniform, min=tpq, max=taq)
Let's have a look to the results:
many_animals
We know have r nrow(many_animals)
observations. Another way to look at them would be to take the first observations in each permutation and for each species:
many_animals %>% group_by(taxa, k) %>% slice_head(n=1)
Observe that the new column x_new
varies for each permutation but is still strictly bounded by tpq
and taq
.
Sometimes you may not want as much randomness. Perhaps your new x_new
should only be allowed to vary within stratigraphical units. That's the job of shake_*_within
which expects a by
argument, so here again, we pass quake
with it.
animals %>% slice(1:10) %>% # we pick only 10 obs (only for speed here) quake(2, shake_uniform_within, tpq, taq, within=us)
Note that lines c(1, 3)
in one hand, and c(6, 7)
in the other share for the first iteration (and all others), the same x_new
because they come from the same us
.
spaghetti0
visualizing trend after quake
Let's say you are happy with the straightforward geom_smooth
from ggplot2. spaghetti0
helps you to do this directly after quake
.
spaghetti0(many_animals, y=value, by=taxa)
The name of the function should be clear now but why the trailing '0' to it? You're right, there is a spaghetti
full stop but it comes after an intermediate modelling step we will see now.
You have many new x values now but how about having them fixed on x values of interest? Here, we can think of slicing the temporal range into centuries, or 20 groups of same size and then get estimates, for each permutations of the value of interest.
Such discretization of a continous data may seen paradoxical when we are precisely randomizing x. It is neither mandatory (if you're happy with spaghetti0
) nor always desirable. Some of its merits and the motivation behind is detailed in ?fitting
(and all ?fit_*
function since they share the same page).
That being said, to do such discretizations you have two different approaches:
define bins, grab data in them and then summarize them, eg using median
, mean
, etc.
fit regression models on full range and then use them to predict y values using a common sequence of x values.
The first approach is the purpose of bin
: the second is the purpose of fit_*
functions.
bin
: bins values and summarizeFrom now on, we will shift to animals_q
which is a built it dataset of animals
with 20 permutations and with dates ranging between 100 BC and 100 AD (ie -100 to 100). See ?animals_q
.
bin
discretizes x values using variations on cut
yet we use the more convenient cut_number
, cut_width
and cut_interval
that comes with ggplot2
Let's bin y values (still names value
) every 50 years and get the median per taxa:
animals_binned <- bin(animals_q, y=value, by=taxa, fun=median, x_bin=seq(-100, 100, 50)) animals_binned
If you need to define bins based on equal number of observations per group, on equal range or width, see ?cutter
.
Note that x_bin
is now a factor. The natural graphical display is a good old boxplot:
ggplot(animals_binned) + aes( x_bin, y_bin, fill=taxa) + geom_boxplot(outlier.size = 0.2) + theme(axis.text.x = element_text(angle=45, h=1))
We will see how to test that in the next section.
fit_*
: fits a regression model and predictYou can also choose to define a regression model for each permutation and use it to predict fixed values. In other words, to gain more control over what spaghetti0
offered you before, almost for free.
Here come the fit_*
family:
animals_gam <- fit_gam(animals_q, y=value, by=taxa, x_pred=seq(-100, 100, 10)) animals_gam
Just as for bin
before, see ?cutter
should you want to define bins based on the number of observations in them, on width, etc.
Here finally comes spaghetti
that displays such pre-digested data. In ggplot2 terms, spaghetti
uses geom_line
where spaghetti0
uses geom_smooth
.
I know you want more graphs and less blabla (because you know you can find it in ?spaghetti
anyway):
spaghetti(animals_gam, by=taxa)
In the same family, you also have fit_loess
too and the good old fit_lm
members. To all of them you can refine the formula to stick to what you want to model. We will try something like : $value ~ x^3 + x$ with no intercept.
animals_origin <- animals_q %>% fit_lm(y=value, formula_rhs = I(x_new^3) + x_new - 1, by=taxa, x_pred=seq(-100, 100, 10), .keep_mod = TRUE)
The argument formula_rhs
stands for "right hand side" and use the classical formula
mini-grammar.
First, let's plot it:
spaghetti(animals_origin, by=taxa)
It's likely pure nonsense here but we do it anyway. Do not try this at lab without statistical safety belts.
Perhaps, you have noticed the .keep_mod
argument above and the additional columns in animals_origins
^[yes, we expect you to copy-paste every single command ;-)]:
animals_origin
When passing fit_*
functions with .keep_mod=TRUE
we do two things: keep the original models, add a single meaningful index on the quality of the model (see ?fitting
, adj.r2
for gam
and lm
; rse
for loess
).
You need a little bit data handling but you can do nice things with them:
# first remove duplicates lms <- animals_origin %>% group_by(k, taxa) %>% slice(1) %>% ungroup() # very low r2 everywhere: pure nonsense confirmed ggplot(lms) + aes(taxa, adj_r2, col=taxa) + geom_violin() + geom_jitter() # you could also access all components of $mod columns, use broom on it, etc. etc.
test_*
: test for differencesWhether you used the bin
or fit
pathway you have now slices of x (eg a factor
) and y values^[not sure testing after a predict is a very orthodox idea though]. We will call bacl animals_binned
from above.
animals_binned
One can now test globally when patterns between taxa differ:
test_globally(animals_binned, x=x_bin, y=y_bin, by=taxa)
And test for pairwise differences:
test_pairwise(animals_binned, x=x_bin, y=y_bin, by=taxa)
You may have warnings
related to ties. See ?wilcox.test
This vignette shows most of what pataqu can help you at, yet most of the time presenting and discussing a spaghetti0
plot would be, in our humble opinions, both enough and parcimonious.
With that it mind, only two functions would be needed to get your results:
quake(animals, k=10, shake_uniform, tpq, taq) %>% spaghetti0(y=value, by=taxa)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.