knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>",
  fig.path = "README-"
)

Linux/OSX Build Status Windows Build status License: AGPL v3 Lifecycle: experimental codecov

Overview

set.seed(87617)

The synthetic package provides tooling to greatly symplify the creation of synthetic datasets for testing purposes. It's features include:

By using a standardized method of serialization benchmarking, benchmark results become more reliable and more easy to compare over various solutions, as can be seen further down in this introduction.

Synthetic datasets

Most R users will probably be familiar with the iris dataset as it's widely used in package examples and tutorials:

library(dplyr)

iris %>%
  as_tibble()

But what if you need a a million row dataset for your purposes? The synthetic package makes that straightforward. Simply define a dataset template using synthetic_table():

library(synthetic)

# define a synthetic table
synt_table <- synthetic_table(iris)

with the template, you can generate any number of rows:

synt_table %>%
  generate(1e6) # a million rows

You can also select specific columns:

synt_table %>%
  generate(1e6, "Species")  # single column

Creating your own template

If you want to generate a dataset with specific characteristics of it's columns, you can use column templates to specify each column directly:

# define a custom template
synt_table <- synthetic_table(
  Logical = template_logical(true_false_na_ratio = c(85, 10, 5)),
  Integer = template_integer(max_value = 100L),
  Real    = template_numerical_uniform(0.01, 100, max_distict_values = 20)
  # ,
  # Factor  = template_string_random(5, 8, ))
)

synt_table %>%
  generate(10)

Benchmarking serialization

Benchmarks performed With synthetic have the following features:

But most importantly, with the use of synthetic, complex benchmarks are reduced to a few simple statements, increasing your productivity and reproducibility!

Walkthrough: setting up a benchmark

A lot of claims are made on the performance of serializers and databases, but the truth is that all solutions have their own strenghts and weaknesses.

some more text here

Define the template of a test dataset:

Do some benchmarking on the fst format:

library(dplyr)

synthetic_bench() %>%
  bench_generators(generator) %>%
  bench_streamers(streamer_fst()) %>%
  bench_rows(1e7) %>%
  collect()

Congratulations, that's your first structured benchmark :-)

Now, let´s add a second streamer and allow for two different sizes of datasets:

synthetic_bench() %>%
  bench_generators(generator) %>%
  bench_streamers(streamer_fst(), streamer_parguet()) %>%  # two streamers
  bench_rows(1e7, 5e7) %>%
  collect()

As you can see, although benchmarking two solutions at different sizes is more complex than the single solution benchmark, with synthetic it´s just a matter of expanding some of the arguments.

Let´s add two more streamers and add compression settings to the mix:

synthetic_bench() %>%
  bench_generators(generator) %>%
  bench_streamers(streamer_rds(), streamer_fst(), streamer_parguet(), streamer_feather()) %>%
  bench_rows(1e7, 5e7) %>%
  bench_compression(50, 80) %>%
  collect()


fstpackage/syntheticbench documentation built on Jan. 23, 2020, 10:13 a.m.