dfCount()
After some basic testing (not extremely thorough), I believe that
dfCount()
performs much faster than its equivalent table()
on large
datasets, especially when the data is numeric. The analysis was done
with the microbenchmark
package to compare the two functions on a few
different datasets.
library(rsalad)
library(dplyr)
library(microbenchmark)
# Prepare all the datasets to test on
fDat <- nycflights13::flights
largeIntDat <- data.frame(col = rep(1:25, 100000))
largeCharDat <- data.frame(col = rep(letters[1:25], 100000))
smallDat <- data.frame(col = rep(1:25, 100))
# Run the benchmarking
m <-
microbenchmark(
dfCount(fDat, "day"), table(fDat$day),
dfCount(fDat, "dest"), table(fDat$dest),
dfCount(largeIntDat, "col"), table(largeIntDat$col),
dfCount(largeCharDat, "col"), table(largeCharDat$col),
dfCount(smallDat, "col"), table(smallDat$col),
times = 10
)
expr
min
mean
median
max
neval
dfCount(fDat, "day")
17.592022
26.477759
23.361650
47.942357
10
table(fDat$day)
124.502858
176.668992
178.783502
220.172421
10
dfCount(fDat, "dest")
20.467399
27.618526
26.889988
35.597483
10
table(fDat$dest)
28.812769
47.353390
46.714503
64.513612
10
dfCount(largeIntDat, "col")
142.112473
179.890952
172.354499
269.982579
10
table(largeIntDat$col)
1072.657027
1564.270936
1489.253169
2379.238431
10
dfCount(largeCharDat, "col")
109.545406
182.959669
202.074962
244.528265
10
table(largeCharDat$col)
200.472889
268.210916
278.629599
330.212632
10
dfCount(smallDat, "col")
2.376538
3.811807
3.696206
5.809396
10
table(smallDat$col)
1.081611
1.798311
1.881731
2.662426
10
Every pair of rows corresponds to counting the same data using
dfCount()
vs table()
. The results show that:
dfCount()
was faster in all 4 large datasetsdfCount()
was an order of magnitude faster in both cases when the
data was numericdfCount()
was slower on very the small datasetAfter performing this analysis, I've realized that the likely cause of
the speed boost is due to dfCount()
relying on dplyr
. After making
that realization, I found that dplyr
also has a count()
function,
which performs equally fast as dfCount()
, which further supports the
hypothesis that the speed boost was thanks to dplyr
. However, I still
want to include this function in the package because it took a lot of
hard work (and documentation!), and it also has a very differences from
dplyr::count()
. For example, dplyr::count()
does not sort by
default, which I find to be the less desired behaviour, and
dplyr::count()
does not have a standard-evaluation version.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.