This is running dplyr and dtplyr on the "rquery Modes Example".
First we define the functions and data we used in the rquery modes example example.
library(rquery) library(data.table) library(microbenchmark)
library(dplyr) packageVersion('dplyr') library(dtplyr) packageVersion('dtplyr')
set.seed(2019) n_rows <- 1000000 d_large <- data.frame( x = rnorm(n = n_rows), y = rnorm(n = n_rows), g = sample(paste0('v_', seq_len(n_rows/10)), size = n_rows, replace = TRUE), stringsAsFactors = FALSE )
ops <- local_td(d_large) %.>% # Describe table for later operations extend(., # add a new column ratio := y / x) %.>% extend(., # rank the rows by group and order simple_rank := row_number(), partitionby = 'g', orderby = 'ratio', reverse = 'ratio') %.>% extend(., # mark the rows we want choice := simple_rank == 1) d_large %.>% ops %.>% order_rows(., 'g') %.>% select_rows(., choice) %.>% head(.) %.>% knitr::kable(.)
f_compiled <- function(dat) { dat %.>% ops # use pre-compiled pipeline } f_immediate <- function(dat) { dat %.>% extend(., # add a new column ratio := y / x) %.>% extend(., # rank the rows by group and order simple_rank := row_number(), partitionby = 'g', orderby = 'ratio', reverse = 'ratio') %.>% extend(., # mark the rows we want choice := simple_rank == 1) } f_wrapped <- function(dat) { dat %.>% wrap %.>% # wrap data in a description extend(., # add a new column ratio := y / x) %.>% extend(., # rank the rows by group and order simple_rank := row_number(), partitionby = 'g', orderby = 'ratio', reverse = 'ratio') %.>% extend(., # mark the rows we want choice := simple_rank == 1) %.>% ex # signal construction done, and execute }
The dplyr version of the pipeline is similar, except the window functions are not a single step- but a 4 stage block.
f_dplyr <- function(dat) { dat %>% mutate( # add a new column ratio := y / x) %>% group_by( # rank the rows by group and order g) %>% arrange( -ratio) %>% mutate( simple_rank := row_number()) %>% ungroup() %>% # end of rank block mutate( # mark the rows we want choice := simple_rank == 1) }
We are using the most current CRAN versions of each (dtplyr is currently being re-engineered to try to also cut down the number conversions).
Above we see a key difference between rquery and dplyr: rquery grouped and window functions are single operators in rquery, but are driven by annotations between steps in dplyr.
dtplyr seems to error-out on this problem, meaning the automatic translations from dplyr to data.table are not sufficient to our task.
f_dplyr(data.table(d_large))
We can try to re-write the dtplyr pipeline as follows. It appears switching from := to = and replacing row_number() with .I helps.
f_dtplyr <- function(dat) { data.table(dat) %>% mutate( # add a new column ratio = y / x) %>% group_by( # rank the rows by group and order g) %>% arrange( -ratio) %>% mutate( simple_rank = 1:.N) %>% ungroup() %>% # end of rank block mutate( # mark the rows we want choice = simple_rank == 1) } res_dtplyr <- f_dtplyr(d_large) res_dtplyr %.>% order_rows(., 'g') %.>% select_rows(., choice) %.>% head(.) %.>% knitr::kable(.)
And we can also time data.table itself (without the translation overhead, though we are adding in the time to convert the data.frame).
f_data_table = function(dat) { dat <- data.table(dat) dat[ , ratio := y / x ][order(-ratio) , simple_rank := 1:.N, by = list(g) ][ , choice := simple_rank == 1] } res_dt <- f_data_table(d_large) res_dt %.>% order_rows(., 'g') %.>% select_rows(., choice) %.>% head(.) %.>% knitr::kable(.)
timings <- microbenchmark( rquery_compiled = f_compiled(d_large), rquery_immediate = f_immediate(d_large), rquery_wrapped = f_wrapped(d_large), dplyr = f_dplyr(d_large), dtplyr = f_dtplyr(d_large), data.table = f_data_table(d_large), times = 10L ) print(timings)
For these short pipelines the extra copying in rquery immediate mode and dtplyr are not causing big problems compared to the overall translation overhead.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.