knitr::opts_chunk$set(tidy = FALSE, comment = "#>")
dfCount()
After some basic testing (not extremely thorough), I believe that dfCount()
performs much faster than its equivalent table()
on large datasets, especially
when the data is numeric. The analysis was done with the microbenchmark
package to compare the two functions on a few different datasets.
library(rsalad) library(dplyr) library(microbenchmark) # Prepare all the datasets to test on fDat <- nycflights13::flights largeIntDat <- data.frame(col = rep(1:25, 100000)) largeCharDat <- data.frame(col = rep(letters[1:25], 100000)) smallDat <- data.frame(col = rep(1:25, 100)) # Run the benchmarking m <- microbenchmark( dfCount(fDat, "day"), table(fDat$day), dfCount(fDat, "dest"), table(fDat$dest), dfCount(largeIntDat, "col"), table(largeIntDat$col), dfCount(largeCharDat, "col"), table(largeCharDat$col), dfCount(smallDat, "col"), table(smallDat$col), times = 10 )
knitr::kable(summary(m) %>% select(expr, min, mean, median, max, neval))
Every pair of rows corresponds to counting the same data using dfCount()
vs
table()
. The results show that:
dfCount()
was faster in all 4 large datasetsdfCount()
was an order of magnitude faster in both cases when the data
was numericdfCount()
was slower on very the small datasetAfter performing this analysis, I've realized that the likely cause of the
speed boost is due to dfCount()
relying on dplyr
. After making that
realization, I found that dplyr
also has a count()
function, which
performs equally fast as dfCount()
, which further supports the hypothesis
that the speed boost was thanks to dplyr
. However, I still want to include
this function in the package because it took a lot of hard work (and
documentation!), and it also has a very differences from dplyr::count()
. For
example, dplyr::count()
does not sort by default, which I find to be
the less desired behaviour, and dplyr::count()
does not have a
standard-evaluation version.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.