The synthetic
package provides tooling to greatly symplify the
creation of synthetic datasets for testing purposes. It’s features
include:
fst
, arrow
, fread
/ fwrite
,
sqlite
, etc.By using a standardized method of serialization benchmarking, benchmark results become more reliable and more easy to compare over various solutions, as can be seen further down in this introduction.
Most R
users will probably be familiar with the iris dataset as it’s
widely used in package examples and tutorials:
library(dplyr)
iris %>%
as_tibble()
#> # A tibble: 150 x 5
#> Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#> <dbl> <dbl> <dbl> <dbl> <fct>
#> 1 5.1 3.5 1.4 0.2 setosa
#> 2 4.9 3 1.4 0.2 setosa
#> 3 4.7 3.2 1.3 0.2 setosa
#> 4 4.6 3.1 1.5 0.2 setosa
#> 5 5 3.6 1.4 0.2 setosa
#> 6 5.4 3.9 1.7 0.4 setosa
#> 7 4.6 3.4 1.4 0.3 setosa
#> 8 5 3.4 1.5 0.2 setosa
#> 9 4.4 2.9 1.4 0.2 setosa
#> 10 4.9 3.1 1.5 0.1 setosa
#> # ... with 140 more rows
But what if you need a dataset of a million rows? The synthetic
package makes that straightforward. Simply define a dataset template
using synthetic_table()
:
library(synthetic)
# define a synthetic table
synt_table <- synthetic_table(iris)
and generate a custom number of rows:
synt_table %>%
generate(1e6) # a million rows
#> # A tibble: 1,000,000 x 5
#> Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#> <dbl> <dbl> <dbl> <dbl> <fct>
#> 1 5.7 3.8 1.7 0.3 setosa
#> 2 5.8 2.7 5.1 1.9 virginica
#> 3 6.1 2.8 4 1.3 versicolor
#> 4 5.1 3.5 1.4 0.2 setosa
#> 5 7.2 3.6 6.1 2.5 virginica
#> 6 6 2.2 5 1.5 virginica
#> 7 6.8 3 5.5 2.1 virginica
#> 8 6.2 3.4 5.4 2.3 virginica
#> 9 7.1 3 5.9 2.1 virginica
#> 10 6.4 2.8 5.6 2.2 virginica
#> # ... with 999,990 more rows
You can also select specific columns:
synt_table %>%
generate(1e6, "Species") # single column
#> # A tibble: 1,000,000 x 1
#> Species
#> <fct>
#> 1 versicolor
#> 2 virginica
#> 3 setosa
#> 4 versicolor
#> 5 versicolor
#> 6 virginica
#> 7 versicolor
#> 8 versicolor
#> 9 setosa
#> 10 virginica
#> # ... with 999,990 more rows
Benchmarks performed With synthetic
have the following features:
But most importantly, with the use of synthetic
, complex benchmarks
are reduced to a few simple statements, increasing your productivity and
reproducibility!
A lot of claims are made on the performance of serializers and databases, but the truth is that all solutions have their own strenghts and weaknesses.
some more text here
Define the template of a test dataset:
library(synthetic)
library(fst)
library(arrow)
# generator for 'fst benchmark' dataset
generator <- table_generator(
"fst benchmark",
function(nr_of_rows) {
data.frame(
Logical = sample_logical(nr_of_rows, true_false_na_ratio = c(85, 10, 5)),
Integer = sample_integer(nr_of_rows, max_value = 100L),
Real = sample_integer(nr_of_rows, 1, 10000, max_distict_values = 20) / 100,
Factor = as.factor(sample(labels(UScitiesD), nr_of_rows, replace = TRUE))
)}
)
Do some benchmarking on the fst format:
library(dplyr)
synthetic_bench() %>%
bench_generators(generator) %>%
bench_streamers(streamer_fst()) %>%
bench_rows(1e7) %>%
collect()
Congratulations, that’s your first structured benchmark :-)
Now, let´s add a second streamer and allow for two different sizes of datasets:
synthetic_bench() %>%
bench_generators(generator) %>%
bench_streamers(streamer_fst(), streamer_parguet()) %>% # two streamers
bench_rows(1e7, 5e7) %>%
collect()
As you can see, although benchmarking two solutions at different sizes
is more complex than the single solution benchmark, with synthetic
it´s just a matter of expanding some of the arguments.
Let´s add two more streamers and add compression settings to the mix:
synthetic_bench() %>%
bench_generators(generator) %>%
bench_streamers(streamer_rds(), streamer_fst(), streamer_parguet(), streamer_feather()) %>%
bench_rows(1e7, 5e7) %>%
bench_compression(50, 80) %>%
collect()
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.