knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)

This vignette quickly presents the rationale of the package then jump to a case study. If you are reading it within R/RStudio, note that pataqu has a dedicated website that gathers the doc with examples, vignettes, news, etc.

http://vbonhomme.github.io/pataqu

Rationale

Sources of uncertainties

We chiefly had in mind archaeological data which dating of remains are often temporally bounded between two certain events, named terminus post/ante quem, further abbreviated tpq and taq^[hence the package name which is also a delicate pun, in French]. The tpq is the earliest date, the taq the latest.

The real event may have happened anytime between these two boundaries with a flat, uniform, density of probability along the interval.

We can also think of radiocarbon dating, which comes with a prediction and a confidence interval (say $\mu ± \sigma$). But here the dating differ and the real event has a density of probability of gaussian nature with parameters $\mu$ and $\sigma$.

More generally, we will show how to deal with arbitrary uncertainties on x, whenever you have an idea on how it varies.

Consequences and how to inspect them

Whatever the source of the uncertainty on x, the graphics, tests and overall stories we can obtain from data may well be affected.

How robust is your lovely story regarding x uncertainties?

pataqu aims at visualizing and testing this using permutations. This is not the only way to do it but in our view it has the merit to only use information contained in the data itself.

The idea is:

  1. Simulate plausible new x values
  2. Display and/or test what you need
  3. Repeat many times and see what happens

Case study: taq and ptq

We will load the package and also use dplyr and ggplot2 from the tidyverse. More generally, if you library(tidyverse) you should not need much more and typically you will not need to dplyr:: or ggplot2:: like what you have in function examples. We treated well the examples so that they are meaningful.

library(dplyr)
library(ggplot2)
library(pataqu)

animals dataset

This dataset is made of real (and still unpublished) data where researchers measured a value of interest on archaeological remains, in several sites (with us or stratigraphical units in them), dated with tpq and taq. The remains belonged to four taxa.

head(animals)
# We only show the first lines but you can View(animals)

How would we treat that? We could decide to display the full interval and see what happens:

# pure cosmetics for lighter graphs
theme_set(theme_minimal())

ggplot(animals) +
  aes(xmin=tpq, xmax=taq, y=value, col=taxa) +
  geom_errorbarh(size=0.2, alpha=0.5)

What a mess! We could try to add a mid point and add a smoother on top of them.

animals %>% 
  mutate(x_new=(tpq+taq)/2) %>% 
  ggplot() +
  aes(x=x_new, y=value, col=taxa) +
  geom_point(size=0.1) +
  geom_smooth(method="loess", formula="y~x", se=FALSE) -> gg_mid
gg_mid

We can now see some trends but we lost the uncertainty on x in the meantime. The graph before may have also been the one below, with the same likelihood.

We take animals and simply draw other x_new values, not on the midpoint but somewhere between the tpq and taq of each observation We use set.seed for the sake of replicability only.

set.seed(2329)
# lets draw new x values
animals2 <- animals %>%
  dplyr::rowwise() %>%
  dplyr::mutate(x_new=stats::runif(n=1, min=tpq, max=taq)) %>%
  dplyr::ungroup()
# and redo gg_mid graph with this new tibble
gg_mid %+% animals2

Same same but different.

shake: generates a new dataset

The code used above is exactly what shake_uniform is made of behind the curtain. You can check by yourself and type pataqu::shake_uniform (no brackets) in the console. Have a look to the different shakers with ?shake.

animals
shake_uniform(animals, min=tpq, max=taq)

pataqu generalizes this idea with decoration around this pattern.

quake: generates many datasets

You can do several shakes at once with the function quake which is the real entry point of analyses with pataqu.

Here we will simulate only 5 new datasets for the sake of speed but you can generate thousands of them^[On my 2013 Mac Book Pro it takes 15 sec. to quake animals and have half a million observations, so this is reasonnably fast] with randomized datations.

quake uses one of the shakers to generate entire datasets, possibly thousands of them. Here, we have tpq and taq dating so we will use shake_uniform and we have to specify the name of the columns^[Arguments names (here min and max can be omitted as long as they come in the order expected by shake_uniform but I think it is a bad idea and not so painful to type].

many_animals <- quake(animals, k=5, shaker=shake_uniform, min=tpq, max=taq)

Let's have a look to the results:

many_animals

We know have r nrow(many_animals) observations. Another way to look at them would be to take the first observations in each permutation and for each species:

many_animals %>% 
  group_by(taxa, k) %>% 
  slice_head(n=1)

Observe that the new column x_new varies for each permutation but is still strictly bounded by tpq and taq.

Sometimes you may not want as much randomness. Perhaps your new x_new should only be allowed to vary within stratigraphical units. That's the job of shake_*_within which expects a by argument, so here again, we pass quake with it.

animals %>% 
  slice(1:10) %>% # we pick only 10 obs (only for speed here)
  quake(2, shake_uniform_within, tpq, taq, within=us)

Note that lines c(1, 3) in one hand, and c(6, 7) in the other share for the first iteration (and all others), the same x_new because they come from the same us.

spaghetti0 visualizing trend after quake

Let's say you are happy with the straightforward geom_smooth from ggplot2. spaghetti0 helps you to do this directly after quake.

spaghetti0(many_animals, y=value, by=taxa)

The name of the function should be clear now but why the trailing '0' to it? You're right, there is a spaghetti full stop but it comes after an intermediate modelling step we will see now.

To bin or not to bin?

You have many new x values now but how about having them fixed on x values of interest? Here, we can think of slicing the temporal range into centuries, or 20 groups of same size and then get estimates, for each permutations of the value of interest.

Such discretization of a continous data may seen paradoxical when we are precisely randomizing x. It is neither mandatory (if you're happy with spaghetti0) nor always desirable. Some of its merits and the motivation behind is detailed in ?fitting (and all ?fit_* function since they share the same page).

That being said, to do such discretizations you have two different approaches:

  1. define bins, grab data in them and then summarize them, eg using median, mean, etc.

  2. fit regression models on full range and then use them to predict y values using a common sequence of x values.

The first approach is the purpose of bin: the second is the purpose of fit_* functions.

bin: bins values and summarize

From now on, we will shift to animals_q which is a built it dataset of animals with 20 permutations and with dates ranging between 100 BC and 100 AD (ie -100 to 100). See ?animals_q.

bin discretizes x values using variations on cut yet we use the more convenient cut_number, cut_width and cut_interval that comes with ggplot2

Let's bin y values (still names value) every 50 years and get the median per taxa:

animals_binned <- bin(animals_q, y=value, by=taxa, fun=median, x_bin=seq(-100, 100, 50))
animals_binned

If you need to define bins based on equal number of observations per group, on equal range or width, see ?cutter.

Note that x_bin is now a factor. The natural graphical display is a good old boxplot:

ggplot(animals_binned) +
  aes( x_bin, y_bin, fill=taxa) +
  geom_boxplot(outlier.size = 0.2) +
  theme(axis.text.x = element_text(angle=45, h=1))

We will see how to test that in the next section.

fit_*: fits a regression model and predict

You can also choose to define a regression model for each permutation and use it to predict fixed values. In other words, to gain more control over what spaghetti0 offered you before, almost for free.

Here come the fit_* family:

animals_gam <- fit_gam(animals_q, y=value, by=taxa, x_pred=seq(-100, 100, 10))
animals_gam

Just as for bin before, see ?cutter should you want to define bins based on the number of observations in them, on width, etc.

Here finally comes spaghetti that displays such pre-digested data. In ggplot2 terms, spaghetti uses geom_line where spaghetti0 uses geom_smooth.

I know you want more graphs and less blabla (because you know you can find it in ?spaghetti anyway):

spaghetti(animals_gam, by=taxa)

In the same family, you also have fit_loess too and the good old fit_lm members. To all of them you can refine the formula to stick to what you want to model. We will try something like : $value ~ x^3 + x$ with no intercept.

animals_origin <- animals_q %>% 
  fit_lm(y=value, 
         formula_rhs = I(x_new^3) + x_new - 1,
         by=taxa,
         x_pred=seq(-100, 100, 10), 
         .keep_mod = TRUE)

The argument formula_rhs stands for "right hand side" and use the classical formula mini-grammar.

First, let's plot it:

spaghetti(animals_origin, by=taxa)

It's likely pure nonsense here but we do it anyway. Do not try this at lab without statistical safety belts.

Perhaps, you have noticed the .keep_mod argument above and the additional columns in animals_origins^[yes, we expect you to copy-paste every single command ;-)]:

animals_origin

When passing fit_* functions with .keep_mod=TRUE we do two things: keep the original models, add a single meaningful index on the quality of the model (see ?fitting, adj.r2 for gam and lm; rse for loess).

You need a little bit data handling but you can do nice things with them:

# first remove duplicates
lms <- animals_origin %>% group_by(k, taxa) %>% slice(1) %>% ungroup()

# very low r2 everywhere: pure nonsense confirmed
ggplot(lms) + 
  aes(taxa, adj_r2, col=taxa) + 
  geom_violin() + geom_jitter()

# you could also access all components of $mod columns, use broom on it, etc. etc.

test_*: test for differences

Whether you used the bin or fit pathway you have now slices of x (eg a factor) and y values^[not sure testing after a predict is a very orthodox idea though]. We will call bacl animals_binned from above.

animals_binned

One can now test globally when patterns between taxa differ:

test_globally(animals_binned, x=x_bin, y=y_bin, by=taxa)

And test for pairwise differences:

test_pairwise(animals_binned, x=x_bin, y=y_bin, by=taxa)

You may have warningsrelated to ties. See ?wilcox.test

Finally

This vignette shows most of what pataqu can help you at, yet most of the time presenting and discussing a spaghetti0 plot would be, in our humble opinions, both enough and parcimonious.

With that it mind, only two functions would be needed to get your results:

quake(animals, k=10, shake_uniform, tpq, taq) %>% 
  spaghetti0(y=value, by=taxa)


vbonhomme/pataqu documentation built on April 24, 2022, 10:03 p.m.