vignettes/dfCountPerf.md

Performance of dfCount()

After some basic testing (not extremely thorough), I believe that dfCount() performs much faster than its equivalent table() on large datasets, especially when the data is numeric. The analysis was done with the microbenchmark package to compare the two functions on a few different datasets.

library(rsalad)
library(dplyr)
library(microbenchmark)
# Prepare all the datasets to test on
fDat <- nycflights13::flights
largeIntDat <- data.frame(col = rep(1:25, 100000))
largeCharDat <- data.frame(col = rep(letters[1:25], 100000))
smallDat <- data.frame(col = rep(1:25, 100))

# Run the benchmarking
m <-
  microbenchmark(
    dfCount(fDat, "day"), table(fDat$day),
    dfCount(fDat, "dest"), table(fDat$dest),
    dfCount(largeIntDat, "col"), table(largeIntDat$col),
    dfCount(largeCharDat, "col"), table(largeCharDat$col),
    dfCount(smallDat, "col"), table(smallDat$col),
    times = 10
  )
expr min mean median max neval dfCount(fDat, "day") 17.592022 26.477759 23.361650 47.942357 10 table(fDat$day) 124.502858 176.668992 178.783502 220.172421 10 dfCount(fDat, "dest") 20.467399 27.618526 26.889988 35.597483 10 table(fDat$dest) 28.812769 47.353390 46.714503 64.513612 10 dfCount(largeIntDat, "col") 142.112473 179.890952 172.354499 269.982579 10 table(largeIntDat$col) 1072.657027 1564.270936 1489.253169 2379.238431 10 dfCount(largeCharDat, "col") 109.545406 182.959669 202.074962 244.528265 10 table(largeCharDat$col) 200.472889 268.210916 278.629599 330.212632 10 dfCount(smallDat, "col") 2.376538 3.811807 3.696206 5.809396 10 table(smallDat$col) 1.081611 1.798311 1.881731 2.662426 10

Every pair of rows corresponds to counting the same data using dfCount() vs table(). The results show that:

After performing this analysis, I've realized that the likely cause of the speed boost is due to dfCount() relying on dplyr. After making that realization, I found that dplyr also has a count() function, which performs equally fast as dfCount(), which further supports the hypothesis that the speed boost was thanks to dplyr. However, I still want to include this function in the package because it took a lot of hard work (and documentation!), and it also has a very differences from dplyr::count(). For example, dplyr::count() does not sort by default, which I find to be the less desired behaviour, and dplyr::count() does not have a standard-evaluation version.



daattali/rsalad documentation built on Oct. 28, 2019, 12:16 p.m.