fastplyr aims to provide a tidyverse frontend using a collapse backend. This means from a user’s point of view the functions behave like the tidyverse equivalents and thus require little to no changes to existing code to convert.
fastplyr is designed to handle operations that involve larger numbers of groups and generally larger data.
You can install the development version of fastplyr from GitHub with:
# install.packages("pak")
pak::pak("NicChr/fastplyr")
Load packages
library(tidyverse)
#> ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
#> ✔ dplyr 1.1.4 ✔ readr 2.1.5
#> ✔ forcats 1.0.0 ✔ stringr 1.5.1
#> ✔ ggplot2 3.5.2 ✔ tibble 3.2.1
#> ✔ lubridate 1.9.4 ✔ tidyr 1.3.1
#> ✔ purrr 1.0.4
#> ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
#> ✖ dplyr::filter() masks stats::filter()
#> ✖ dplyr::lag() masks stats::lag()
#> ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(fastplyr)
#>
#> Attaching package: 'fastplyr'
#>
#> The following object is masked from 'package:dplyr':
#>
#> desc
#>
#> The following objects are masked from 'package:tidyr':
#>
#> crossing, nesting
library(nycflights13)
library(bench)
While the syntax and user-interface of fastplyr aligns very closely with dplyr most of the time, there can be a few key differences.
.by
Groups are sorted by order of first appearance always when using
.by
Groups are always sorted by default, even when using
.by
. One can use the other sorting through
f_group_by(.order = F)
Many groups
Generally slow for data with many groups.
Designed to be fast for data with many groups.
Handling of dots (...
)
dplyr almost always executes ...
expressions in a way
that latter expressions depend on previous ones
Some functions like f_summarise
and
f_expand
execute the expressions in ...
independently.
Duplicate rows
No dedicated function for this, solution using
group_by |> filter(n() > 1)
are generally slow for
larger data.
Dedicated function f_duplicates
can do this very fast
and with fine control.
Unique group IDs
Achieved through mutate(cur_group_id())
Dedicated fast function add_group_id()
Row slicing
slice()
supports data-masked expressions supplied to
...
Data-masked expressions not supported in f_slice_
functions. Use f_filter()
for this behaviour.
Memory usage
High memory usage
Lower usage compared to dplyr
joins
Accepts different types of joins, e.g. rolling and equality
joins.
Accepts only equality joins of the form x == y
rowwise
rowwise_df
accepted and everything sub-setted implictly
using [[
rowwise_df
not accepted, must use
f_rowwise_df
which creates a grouped_df
with a
row ID col. Implicit [[
subsetting does not occur.
All tidyverse alternative functions are prefixed with ‘f_’. For
example, dplyr::distinct
becomes fastplyr::f_distinct
.
flights |>
f_distinct(origin, dest)
#> ! The following functions will be optimised package-wide:
#> `sum`
#> `prod`
#> `mean`
#> `median`
#> `min`
#> `max`
#> `sd`
#> `var`
#> `dplyr::n`
#> `dplyr::first`
#> `dplyr::last`
#> `dplyr::n_distinct`
#> `dplyr::row_number`
#> `dplyr::lag`
#> `dplyr::lead`
#> `dplyr::cur_group`
#> `dplyr::cur_group_id`
#> `dplyr::cur_group_rows`
#>
#> Optimised expressions are independent from each other and typical data-masking
#> rules may not apply
#>
#> Run `options(fastplyr.optimise = FALSE)` to disable optimisations globally
#>
#> Run `options(fastplyr.inform = FALSE)` to disable this message
#> This message is displayed once per session.
#> # A tibble: 224 × 2
#> origin dest
#> <chr> <chr>
#> 1 EWR IAH
#> 2 LGA IAH
#> 3 JFK MIA
#> 4 JFK BQN
#> 5 LGA ATL
#> # ℹ 219 more rows
f_distinct
has an additional .order
argument which is much faster
than sorting afterwards.
mark(
fastplyr_distinct_sort = flights |>
f_distinct(across(where(is.numeric)), .order = TRUE),
dplyr_distinct_sort = flights |>
distinct(across(where(is.numeric))) |>
arrange_all()
)
#> Warning: Some expressions had a GC in every iteration; so filtering is
#> disabled.
#> # A tibble: 2 × 6
#> expression min median `itr/sec` mem_alloc `gc/sec`
#> <bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl>
#> 1 fastplyr_distinct_sort 24ms 26.7ms 31.9 35.6MB 21.9
#> 2 dplyr_distinct_sort 57.2ms 58.4ms 15.1 73.7MB 18.9
f_group_by
operates very similarly with an additional feature that
allows you to specify whether group data should be ordered or not. This
ultimately controls if the groups end up sorted in expressions like
count
and summarise
, but also in this case f_count
and
f_summarise
.
# Like dplyr
flights |>
f_group_by(month) |>
f_count()
#> # A tibble: 12 × 2
#> month n
#> <int> <int>
#> 1 1 27004
#> 2 2 24951
#> 3 3 28834
#> 4 4 28330
#> 5 5 28796
#> # ℹ 7 more rows
# Group data is sorted by order-of-first appearance
flights |>
f_group_by(month, .order = FALSE) |>
f_count()
#> # A tibble: 12 × 2
#> month n
#> <int> <int>
#> 1 1 27004
#> 2 10 28889
#> 3 11 27268
#> 4 12 28135
#> 5 2 24951
#> # ℹ 7 more rows
Just a reminder that all fastplyr functions are interchangeable with dplyr ones both ways
### With dplyr::count
flights |>
f_group_by(month) |>
count()
#> # A tibble: 12 × 2
#> # Groups: month [12]
#> month n
#> <int> <int>
#> 1 1 27004
#> 2 2 24951
#> 3 3 28834
#> 4 4 28330
#> 5 5 28796
#> # ℹ 7 more rows
### With dplyr::group_by
flights |>
group_by(month) |>
f_count()
#> # A tibble: 12 × 2
#> month n
#> <int> <int>
#> 1 1 27004
#> 2 2 24951
#> 3 3 28834
#> 4 4 28330
#> 5 5 28796
#> # ℹ 7 more rows
f_summarise
behaves like dplyr’s summarise
except for two things:
grouped_flights <- flights |>
group_by(across(where(is.character)))
grouped_flights |>
f_summarise(
n = n(), mean_dep_delay = mean(dep_delay)
)
#> # A tibble: 52,807 × 6
#> carrier tailnum origin dest n mean_dep_delay
#> <chr> <chr> <chr> <chr> <int> <dbl>
#> 1 9E N146PQ JFK ATL 8 9.62
#> 2 9E N153PQ JFK ATL 5 -0.4
#> 3 9E N161PQ JFK ATL 3 -2
#> 4 9E N162PQ EWR DTW 1 160
#> 5 9E N162PQ JFK ATL 1 -6
#> # ℹ 52,802 more rows
And a benchmark
mark(
fastplyr_summarise = grouped_flights |>
f_summarise(
n = n(), mean_dep_delay = mean(dep_delay)
),
dplyr_summarise = grouped_flights |>
summarise(
n = n(), mean_dep_delay = mean(dep_delay, na.rm = TRUE),
.groups = "drop"
)
)
#> Warning: Some expressions had a GC in every iteration; so filtering is
#> disabled.
#> # A tibble: 2 × 6
#> expression min median `itr/sec` mem_alloc `gc/sec`
#> <bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl>
#> 1 fastplyr_summarise 2.59ms 3.05ms 277. 3.58MB 15.9
#> 2 dplyr_summarise 597.01ms 597.01ms 1.68 9.57MB 15.1
Joins work much the same way as in dplyr.
left <- flights |>
f_select(origin, dest, time_hour)
hours <- sample(unique(left$time_hour), 5000)
right <- as.data.frame(unclass(as.POSIXlt(hours)))
right$time_hour <- hours
# Left join
left |>
f_left_join(right)
#> # A tibble: 336,776 × 14
#> origin dest time_hour sec min hour mday mon year wday
#> <chr> <chr> <dttm> <dbl> <int> <int> <int> <int> <int> <int>
#> 1 EWR IAH 2013-01-01 05:00:00 0 0 5 1 0 113 2
#> 2 LGA IAH 2013-01-01 05:00:00 0 0 5 1 0 113 2
#> 3 JFK MIA 2013-01-01 05:00:00 0 0 5 1 0 113 2
#> 4 JFK BQN 2013-01-01 05:00:00 0 0 5 1 0 113 2
#> 5 LGA ATL 2013-01-01 06:00:00 0 0 6 1 0 113 2
#> # ℹ 336,771 more rows
#> # ℹ 4 more variables: yday <int>, isdst <int>, zone <chr>, gmtoff <int>
# inner join
left |>
f_inner_join(right)
#> # A tibble: 244,029 × 14
#> origin dest time_hour sec min hour mday mon year wday
#> <chr> <chr> <dttm> <dbl> <int> <int> <int> <int> <int> <int>
#> 1 EWR IAH 2013-01-01 05:00:00 0 0 5 1 0 113 2
#> 2 LGA IAH 2013-01-01 05:00:00 0 0 5 1 0 113 2
#> 3 JFK MIA 2013-01-01 05:00:00 0 0 5 1 0 113 2
#> 4 JFK BQN 2013-01-01 05:00:00 0 0 5 1 0 113 2
#> 5 LGA ATL 2013-01-01 06:00:00 0 0 6 1 0 113 2
#> # ℹ 244,024 more rows
#> # ℹ 4 more variables: yday <int>, isdst <int>, zone <chr>, gmtoff <int>
# Anti join
left |>
f_anti_join(right)
#> # A tibble: 92,747 × 3
#> origin dest time_hour
#> <chr> <chr> <dttm>
#> 1 LGA ATL 2013-01-01 14:00:00
#> 2 LGA ATL 2013-01-01 14:00:00
#> 3 EWR ORD 2013-01-01 14:00:00
#> 4 EWR SEA 2013-01-01 14:00:00
#> 5 EWR ORD 2013-01-01 14:00:00
#> # ℹ 92,742 more rows
# Semi join
left |>
f_semi_join(right)
#> # A tibble: 244,029 × 3
#> origin dest time_hour
#> <chr> <chr> <dttm>
#> 1 EWR IAH 2013-01-01 05:00:00
#> 2 LGA IAH 2013-01-01 05:00:00
#> 3 JFK MIA 2013-01-01 05:00:00
#> 4 JFK BQN 2013-01-01 05:00:00
#> 5 LGA ATL 2013-01-01 06:00:00
#> # ℹ 244,024 more rows
# full join
left |>
f_full_join(right)
#> # A tibble: 336,776 × 14
#> origin dest time_hour sec min hour mday mon year wday
#> <chr> <chr> <dttm> <dbl> <int> <int> <int> <int> <int> <int>
#> 1 EWR IAH 2013-01-01 05:00:00 0 0 5 1 0 113 2
#> 2 LGA IAH 2013-01-01 05:00:00 0 0 5 1 0 113 2
#> 3 JFK MIA 2013-01-01 05:00:00 0 0 5 1 0 113 2
#> 4 JFK BQN 2013-01-01 05:00:00 0 0 5 1 0 113 2
#> 5 LGA ATL 2013-01-01 06:00:00 0 0 6 1 0 113 2
#> # ℹ 336,771 more rows
#> # ℹ 4 more variables: yday <int>, isdst <int>, zone <chr>, gmtoff <int>
And a benchmark comparing fastplyr and dplyr joins
mark(
fastplyr_left_join = f_left_join(left, right, by = "time_hour"),
dplyr_left_join = left_join(left, right, by = "time_hour")
)
#> # A tibble: 2 × 6
#> expression min median `itr/sec` mem_alloc `gc/sec`
#> <bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl>
#> 1 fastplyr_left_join 6.33ms 6.65ms 145. 18MB 66.1
#> 2 dplyr_left_join 21.44ms 22.93ms 43.9 45MB 87.9
mark(
fastplyr_inner_join = f_inner_join(left, right, by = "time_hour"),
dplyr_inner_join = inner_join(left, right, by = "time_hour")
)
#> # A tibble: 2 × 6
#> expression min median `itr/sec` mem_alloc `gc/sec`
#> <bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl>
#> 1 fastplyr_inner_join 4.78ms 5.23ms 182. 22.2MB 56.6
#> 2 dplyr_inner_join 15.93ms 17.23ms 58.1 37.9MB 83.9
mark(
fastplyr_anti_join = f_anti_join(left, right, by = "time_hour"),
dplyr_anti_join = anti_join(left, right, by = "time_hour")
)
#> # A tibble: 2 × 6
#> expression min median `itr/sec` mem_alloc `gc/sec`
#> <bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl>
#> 1 fastplyr_anti_join 2.18ms 2.43ms 390. 3.76MB 25.5
#> 2 dplyr_anti_join 11.22ms 12.15ms 80.1 21.8MB 33.9
mark(
fastplyr_semi_join = f_semi_join(left, right, by = "time_hour"),
dplyr_semi_join = semi_join(left, right, by = "time_hour")
)
#> # A tibble: 2 × 6
#> expression min median `itr/sec` mem_alloc `gc/sec`
#> <bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl>
#> 1 fastplyr_semi_join 3.41ms 3.63ms 261. 7.8MB 32.3
#> 2 dplyr_semi_join 11.91ms 12.72ms 77.3 21.9MB 43.7
mark(
fastplyr_full_join = f_full_join(left, right, by = "time_hour"),
dplyr_full_join = full_join(left, right, by = "time_hour")
)
#> # A tibble: 2 × 6
#> expression min median `itr/sec` mem_alloc `gc/sec`
#> <bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl>
#> 1 fastplyr_full_join 7.23ms 7.99ms 113. 19.3MB 41.2
#> 2 dplyr_full_join 20.4ms 21.79ms 44.9 44.6MB 117.
f_slice
and other f_slice_
functions are very fast for many groups.
grouped_flights |>
f_slice(1)
#> # A tibble: 52,807 × 19
#> # Groups: carrier, tailnum, origin, dest [52,807]
#> year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
#> <int> <int> <int> <int> <int> <dbl> <int> <int>
#> 1 2013 1 7 614 615 -1 812 855
#> 2 2013 1 8 612 615 -3 901 855
#> 3 2013 1 9 615 615 0 NA 855
#> 4 2013 1 25 1530 1250 160 1714 1449
#> 5 2013 2 24 609 615 -6 835 855
#> # ℹ 52,802 more rows
#> # ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
#> # tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
#> # hour <dbl>, minute <dbl>, time_hour <dttm>
grouped_flights |>
f_slice_head(3)
#> # A tibble: 125,770 × 19
#> # Groups: carrier, tailnum, origin, dest [52,807]
#> year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
#> <int> <int> <int> <int> <int> <dbl> <int> <int>
#> 1 2013 1 7 614 615 -1 812 855
#> 2 2013 1 13 612 615 -3 853 855
#> 3 2013 2 3 617 615 2 902 855
#> 4 2013 1 8 612 615 -3 901 855
#> 5 2013 1 22 614 615 -1 857 855
#> # ℹ 125,765 more rows
#> # ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
#> # tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
#> # hour <dbl>, minute <dbl>, time_hour <dttm>
A quick benchmark to prove the point
mark(
fastplyr_slice = grouped_flights |>
f_slice_head(n = 3),
dplyr_slice = grouped_flights |>
slice_head(n = 3)
)
#> Warning: Some expressions had a GC in every iteration; so filtering is
#> disabled.
#> # A tibble: 2 × 6
#> expression min median `itr/sec` mem_alloc `gc/sec`
#> <bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl>
#> 1 fastplyr_slice 20.09ms 24.8ms 39.1 23.8MB 9.78
#> 2 dplyr_slice 2.38s 2.38s 0.421 26.6MB 19.4
In dplyr to work with group IDs you must use the mutate()
+
cur_group_id()
paradigm.
In fastplyr you can just use add_group_id()
which is blazing fast.
## Unique ID for each group
grouped_flights |>
add_group_id() |>
f_select(group_id)
#> ℹ Adding missed group variables:
#> carrier, tailnum, origin, dest
#> # A tibble: 336,776 × 5
#> # Groups: carrier, tailnum, origin, dest [52,807]
#> carrier tailnum origin dest group_id
#> <chr> <chr> <chr> <chr> <int>
#> 1 UA N14228 EWR IAH 35951
#> 2 UA N24211 LGA IAH 36937
#> 3 AA N619AA JFK MIA 8489
#> 4 B6 N804JB JFK BQN 15462
#> 5 DL N668DN LGA ATL 20325
#> # ℹ 336,771 more rows
Another benchmark
mark(
fastplyr_group_id = grouped_flights |>
add_group_id() |>
f_select(all_of(group_vars(grouped_flights)), group_id),
dplyr_group_id = grouped_flights |>
mutate(group_id = cur_group_id()) |>
select(all_of(group_vars(grouped_flights)), group_id)
)
#> Warning: Some expressions had a GC in every iteration; so filtering is
#> disabled.
#> # A tibble: 2 × 6
#> expression min median `itr/sec` mem_alloc `gc/sec`
#> <bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl>
#> 1 fastplyr_group_id 1.99ms 2.37ms 380. 3.35MB 13.9
#> 2 dplyr_group_id 236.33ms 237.57ms 4.18 3.24MB 20.9
Based closely on tidyr::expand
, f_expand()
can cross joins multiple
vectors and data frames.
mark(
fastplyr_expand = flights |>
f_group_by(origin, tailnum) |>
f_expand(month = 1:12),
tidyr_expand = flights |>
group_by(origin, tailnum) |>
expand(month = 1:12),
check = FALSE
)
#> Warning: Some expressions had a GC in every iteration; so filtering is
#> disabled.
#> # A tibble: 2 × 6
#> expression min median `itr/sec` mem_alloc `gc/sec`
#> <bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl>
#> 1 fastplyr_expand 15.6ms 20.86ms 40.6 11.3MB 9.66
#> 2 tidyr_expand 4.14s 4.14s 0.242 81MB 3.86
# Using `.cols` in `f_expand()` is very fast!
mark(
fastplyr_expand = flights |>
f_group_by(origin, dest) |>
f_expand(.cols = c("year", "month", "day")),
tidyr_expand = flights |>
group_by(origin, dest) |>
expand(year, month, day),
check = FALSE
)
#> Warning: Some expressions had a GC in every iteration; so filtering is
#> disabled.
#> # A tibble: 2 × 6
#> expression min median `itr/sec` mem_alloc `gc/sec`
#> <bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl>
#> 1 fastplyr_expand 11.6ms 13.3ms 71.1 16.8MB 9.88
#> 2 tidyr_expand 168.8ms 176.6ms 5.71 66.7MB 5.71
Finding duplicate rows is a very common dataset operation and there is a
dedicated function f_duplicates()
to do exactly this.
flights |>
f_duplicates(time_hour)
#> # A tibble: 329,840 × 1
#> time_hour
#> <dttm>
#> 1 2013-01-01 05:00:00
#> 2 2013-01-01 05:00:00
#> 3 2013-01-01 05:00:00
#> 4 2013-01-01 05:00:00
#> 5 2013-01-01 06:00:00
#> # ℹ 329,835 more rows
Benchmark against a common dplyr strategy for finding duplicates
mark(
fastplyr_duplicates = flights |>
f_duplicates(time_hour, .both_ways = TRUE, .add_count = TRUE, .keep_all = TRUE),
dplyr_duplicates = flights |>
add_count(time_hour) |>
filter(n > 1)
)
#> # A tibble: 2 × 6
#> expression min median `itr/sec` mem_alloc `gc/sec`
#> <bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl>
#> 1 fastplyr_duplicates 11.7ms 12.4ms 79.3 45.1MB 73.6
#> 2 dplyr_duplicates 52.3ms 54.7ms 18.5 59.5MB 37.0
In the worst-case scenarios, f_filter()
is about the same speed as
filter()
and in the best-case is much faster and more efficient. This
is especially true for large data where small subsets of the data are
returned.
full <- new_tbl(x = rnorm(5e07))
# A worst case scenario
mark(
fastplyr_filter = full |>
f_filter(abs(x) > 0),
dplyr_filter = full |>
filter(abs(x) > 0)
)
#> Warning: Some expressions had a GC in every iteration; so filtering is
#> disabled.
#> # A tibble: 2 × 6
#> expression min median `itr/sec` mem_alloc `gc/sec`
#> <bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl>
#> 1 fastplyr_filter 859ms 859ms 1.16 1.12GB 1.16
#> 2 dplyr_filter 667ms 667ms 1.50 1.68GB 3.00
# Best case scenario - filter results in small subset
mark(
fastplyr_filter = full |>
f_filter(x > 4),
dplyr_filter = full |>
filter(x > 4)
)
#> Warning: Some expressions had a GC in every iteration; so filtering is
#> disabled.
#> # A tibble: 2 × 6
#> expression min median `itr/sec` mem_alloc `gc/sec`
#> <bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl>
#> 1 fastplyr_filter 169ms 179ms 5.57 191MB 0
#> 2 dplyr_filter 354ms 366ms 2.73 763MB 2.73
Binding columns is particular much faster but binding rows is also sufficiently faster
mark(
fastplyr_bind_cols = f_bind_cols(grouped_flights, grouped_flights),
dplyr_bind_cols = suppressMessages(
bind_cols(grouped_flights, grouped_flights)
),
check = FALSE
)
#> Warning: Some expressions had a GC in every iteration; so filtering is
#> disabled.
#> # A tibble: 2 × 6
#> expression min median `itr/sec` mem_alloc `gc/sec`
#> <bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl>
#> 1 fastplyr_bind_cols 34.6µs 38.1µs 23596. 25.7KB 2.36
#> 2 dplyr_bind_cols 184.6ms 192.1ms 5.22 981.8KB 5.22
mark(
fastplyr_bind_rows = f_bind_rows(grouped_flights, grouped_flights),
dplyr_bind_rows = bind_rows(grouped_flights, grouped_flights)
)
#> # A tibble: 2 × 6
#> expression min median `itr/sec` mem_alloc `gc/sec`
#> <bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl>
#> 1 fastplyr_bind_rows 49.6ms 52.4ms 19.0 86MB 0
#> 2 dplyr_bind_rows 160ms 162ms 6.17 158MB 3.09
A typical tidy approach might use a mixture of reframe()
and
enframe()
which is a perfectly tidy and neat solution
probs <- seq(0, 1, 0.25)
mtcars <- as_tbl(mtcars)
mtcars |>
group_by(cyl) |>
reframe(enframe(quantile(mpg, probs), "quantile", "mpg"))
#> # A tibble: 15 × 3
#> cyl quantile mpg
#> <dbl> <chr> <dbl>
#> 1 4 0% 21.4
#> 2 4 25% 22.8
#> 3 4 50% 26
#> 4 4 75% 30.4
#> 5 4 100% 33.9
#> # ℹ 10 more rows
fastplyr though has a dedicated function for quantile calculation,
tidy_quantiles()
which requires less code to type
# Wide
mtcars |>
tidy_quantiles(mpg, .by = cyl, pivot = "wide")
#> # A tibble: 3 × 6
#> cyl p0 p25 p50 p75 p100
#> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 4 21.4 22.8 26 30.4 33.9
#> 2 6 17.8 18.6 19.7 21 21.4
#> 3 8 10.4 14.4 15.2 16.2 19.2
# Long
mtcars |>
tidy_quantiles(mpg, .by = cyl, pivot = "long")
#> # A tibble: 15 × 3
#> cyl .quantile mpg
#> <dbl> <fct> <dbl>
#> 1 4 p0 21.4
#> 2 4 p25 22.8
#> 3 4 p50 26
#> 4 4 p75 30.4
#> 5 4 p100 33.9
#> # ℹ 10 more rows
Not only can you choose how to pivot as shown above, you can also calculate quantiles for multiple variables.
multiple_quantiles <- mtcars |>
tidy_quantiles(across(where(is.numeric)), pivot = "long")
multiple_quantiles
#> # A tibble: 5 × 12
#> .quantile mpg cyl disp hp drat wt qsec vs am gear carb
#> <fct> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 p0 10.4 4 71.1 52 2.76 1.51 14.5 0 0 3 1
#> 2 p25 15.4 4 121. 96.5 3.08 2.58 16.9 0 0 3 2
#> 3 p50 19.2 6 196. 123 3.70 3.32 17.7 0 0 4 2
#> 4 p75 22.8 8 326 180 3.92 3.61 18.9 1 1 4 4
#> 5 p100 33.9 8 472 335 4.93 5.42 22.9 1 1 5 8
# Quantile names is a convenient factor
multiple_quantiles$.quantile
#> [1] p0 p25 p50 p75 p100
#> Levels: p0 p25 p50 p75 p100
tidy_quantiles()
of course is fast when many groups are involved.
mark(
fastplyr_quantiles = flights |>
tidy_quantiles(dep_delay, pivot = "long",
.by = c(year, month, day, origin)),
dplyr_quantiles = flights |>
group_by(year, month, day, origin) |>
reframe(enframe(quantile(dep_delay, seq(0, 1, 0.25), na.rm = TRUE))),
check = FALSE
)
#> # A tibble: 2 × 6
#> expression min median `itr/sec` mem_alloc `gc/sec`
#> <bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl>
#> 1 fastplyr_quantiles 20.9ms 21.4ms 45.8 4.21MB 2.08
#> 2 dplyr_quantiles 155.4ms 155.4ms 6.43 24.98MB 19.3
Let’s run some more benchmarks for fun, this time including tidytable which fastplyr is very similar to as it also uses a tidy frontend but a data.table backend
n_rows <- 10^7
n_groups <- 10^6
tbl <- new_tbl(x = rnorm(n_rows))
tbl <- tbl |>
mutate(y = as.character(round(x, 6)),
g = sample.int(n_groups, n_rows, TRUE))
tbl
#> # A tibble: 10,000,000 × 3
#> x y g
#> <dbl> <chr> <int>
#> 1 1.29 1.285351 433366
#> 2 -1.61 -1.613842 887462
#> 3 -0.787 -0.787209 550879
#> 4 -0.490 -0.489809 875660
#> 5 0.393 0.393453 550619
#> # ℹ 9,999,995 more rows
For this we will be using the .by
argument from each package. Because
fastplyr still sorts the groups by default here we will set an internal
option to use the alternative grouping algorithm that sorts groups by
order of first appearance. This will likely be revisited at some point.
To read about the differences, see ?collapse::GRP
.
library(tidytable)
#> Warning: tidytable was loaded after dplyr.
#> This can lead to most dplyr functions being overwritten by tidytable functions.
#> Warning: tidytable was loaded after tidyr.
#> This can lead to most tidyr functions being overwritten by tidytable functions.
#>
#> Attaching package: 'tidytable'
#> The following objects are masked from 'package:fastplyr':
#>
#> crossing, desc, nesting
#> The following objects are masked from 'package:dplyr':
#>
#> across, add_count, add_tally, anti_join, arrange, between,
#> bind_cols, bind_rows, c_across, case_match, case_when, coalesce,
#> consecutive_id, count, cross_join, cume_dist, cur_column, cur_data,
#> cur_group_id, cur_group_rows, dense_rank, desc, distinct, filter,
#> first, full_join, group_by, group_cols, group_split, group_vars,
#> if_all, if_any, if_else, inner_join, is_grouped_df, lag, last,
#> lead, left_join, min_rank, mutate, n, n_distinct, na_if, nest_by,
#> nest_join, nth, percent_rank, pick, pull, recode, reframe,
#> relocate, rename, rename_with, right_join, row_number, rowwise,
#> select, semi_join, slice, slice_head, slice_max, slice_min,
#> slice_sample, slice_tail, summarise, summarize, tally, top_n,
#> transmute, tribble, ungroup
#> The following objects are masked from 'package:purrr':
#>
#> map, map_chr, map_dbl, map_df, map_dfc, map_dfr, map_int, map_lgl,
#> map_vec, map2, map2_chr, map2_dbl, map2_df, map2_dfc, map2_dfr,
#> map2_int, map2_lgl, map2_vec, pmap, pmap_chr, pmap_dbl, pmap_df,
#> pmap_dfc, pmap_dfr, pmap_int, pmap_lgl, pmap_vec, walk
#> The following objects are masked from 'package:tidyr':
#>
#> complete, crossing, drop_na, expand, expand_grid, extract, fill,
#> nest, nesting, pivot_longer, pivot_wider, replace_na, separate,
#> separate_longer_delim, separate_rows, separate_wider_delim,
#> separate_wider_regex, tribble, uncount, unite, unnest,
#> unnest_longer, unnest_wider
#> The following objects are masked from 'package:tibble':
#>
#> enframe, tribble
#> The following objects are masked from 'package:stats':
#>
#> dt, filter, lag
#> The following object is masked from 'package:base':
#>
#> %in%
tidy_tbl <- as_tidytable(tbl)
# Setting an internal option to set all grouping to use the non-sorted type
options(.fastplyr.order.groups = FALSE)
tidytable::setDTthreads(1) # Single-threaded for fair comparison
mark(
fastplyr_slice = tbl |>
f_slice(3:5, .by = g),
tidytable_slice = tidy_tbl |>
slice(3:5, .by = g),
check = FALSE,
min_iterations = 3
)
#> Warning: Some expressions had a GC in every iteration; so filtering is
#> disabled.
#> # A tibble: 2 × 6
#> expression min median `itr/sec` mem_alloc `gc/sec`
#> <bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl>
#> 1 fastplyr_slice 703.81ms 731.68ms 1.32 136MB 0.439
#> 2 tidytable_slice 5.91s 6.15s 0.161 188MB 1.94
mark(
fastplyr_slice_head = tbl |>
f_slice_head(n = 3, .by = g),
tidytable_slice_head = tidy_tbl |>
slice_head(n = 3, .by = g),
fastplyr_slice_tail = tbl |>
f_slice_tail(n = 3, .by = g),
tidytable_slice_tail = tidy_tbl |>
slice_tail(n = 3, .by = g),
check = FALSE,
min_iterations = 3
)
#> Warning: Some expressions had a GC in every iteration; so filtering is
#> disabled.
#> # A tibble: 4 × 6
#> expression min median `itr/sec` mem_alloc `gc/sec`
#> <bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl>
#> 1 fastplyr_slice_head 549.82ms 553.13ms 1.78 187MB 0
#> 2 tidytable_slice_head 1.64s 1.68s 0.594 187MB 1.39
#> 3 fastplyr_slice_tail 563.74ms 569.14ms 1.76 191MB 0
#> 4 tidytable_slice_tail 2.87s 3.09s 0.325 187MB 1.84
Here we’ll calculate the mean of x by each group of g
Both tidytable and fastplyr have optimisations for mean()
when it
involves groups. tidytable internally uses data.table’s ‘gforce’ mean
function. This is basically a dedicated C function to calculate means
for many groups.
mark(
fastplyr_sumarise = tbl |>
f_summarise(mean = mean(x), .by = g),
tidytable_sumarise = tidy_tbl |>
summarise(mean = mean(x), .by = g, .sort = FALSE),
check = FALSE,
min_iterations = 3
)
#> # A tibble: 2 × 6
#> expression min median `itr/sec` mem_alloc `gc/sec`
#> <bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl>
#> 1 fastplyr_sumarise 215ms 231ms 4.34 57.2MB 0
#> 2 tidytable_sumarise 592ms 606ms 1.65 305.3MB 0.825
Benchmarking more statistical functions
mark(
fastplyr_sumarise2 = tbl |>
f_summarise(n = dplyr::n(), mean = mean(x), min = min(x), max = max(x), .by = g),
tidytable_sumarise2 = tidy_tbl |>
summarise(n = n(), mean = mean(x), min = min(x), max = max(x),
.by = g, .sort = FALSE),
check = FALSE,
min_iterations = 3
)
#> # A tibble: 2 × 6
#> expression min median `itr/sec` mem_alloc `gc/sec`
#> <bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl>
#> 1 fastplyr_sumarise2 325ms 331ms 3.02 72.5MB 1.51
#> 2 tidytable_sumarise2 687ms 699ms 1.43 320.6MB 0.715
mark(
fastplyr_count = tbl |>
f_count(y, g),
tidytable_count = tidy_tbl |>
count(y, g),
check = FALSE,
min_iterations = 3
)
#> # A tibble: 2 × 6
#> expression min median `itr/sec` mem_alloc `gc/sec`
#> <bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl>
#> 1 fastplyr_count 533.85ms 536.56ms 1.86 229MB 0.932
#> 2 tidytable_count 2.56s 2.56s 0.390 496MB 0.195
It’s clear both fastplyr and tidytable are fast and each have their strengths and weaknesses.
Any scripts or data that you put into this service are public.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.