The Datasaurus data package

This package wraps the awesome Datasaurus Dozen dataset, which contains 13 sets of x-y data. Each sub-dataset has five statistics that are (almost) the same in each case. (These are the mean of x, mean of y, standard deviation of x, standard deviation of y, and Pearson correlation between x and y). However, scatter plots reveal that each sub-dataset looks very different. The dataset is intended to be used to teach students that it is important to plot their own datasets, rather than relying only on statistics.

The Datasaurus was created by Alberto Cairo in this great blog post.

Datasaurus shows us why visualisation is important, not just summary statistics.

He's been subsequently made even more famous in the paper Same Stats, Different Graphs: Generating Datasets with Varied Appearance and Identical Statistics through Simulated Annealing by Justin Matejka and George Fitzmaurice.

In the paper, Justin and George simulate a variety of datasets that the same summary statistics to the Datasaurus but have very different distributions.

This package looks to make these datasets available for use as an advanced Anscombe's Quartet, available in R as anscombe.


To see that statistics are (almost) the same for each sub-dataset, you can use dplyr.

  datasaurus_dozen %>% 
    group_by(dataset) %>% 
      mean_x    = mean(x),
      mean_y    = mean(y),
      std_dev_x = sd(x),
      std_dev_y = sd(y),
      corr_x_y  = cor(x, y)

To see that each sub-dataset looks very different, you can draw scatter plots.

  ggplot(datasaurus_dozen, aes(x=x, y=y, colour=dataset))+
    theme(legend.position = "none")+
    facet_wrap(~dataset, ncol=3)

Try the datasauRus package in your browser

Any scripts or data that you put into this service are public.

datasauRus documentation built on Sept. 21, 2018, 6:15 p.m.