Description Usage Arguments Details Value AUTO Author(s) See Also Examples
Approximates the number of distinct combinations for the given attributes.
1 | BloomFilter(data, exponent = 16, inputs = AUTO, outputs = count)
|
data |
an object of class |
exponent |
the size of the hash is 2^n bits. |
inputs |
which attributes of |
outputs |
the desired column names of the result. |
This GLA functions similarly to CountDistinct
but it
uses a Bloom Filter to check for repeats rather than a full hash. By
doing so, it requires much less space at the cost of accuracy. The
error can be minimized by ensuring that exponent is set correctly.
The number of bits should be at least a tenth of the number of distinct elements. If not, then it is likely that hash will completely fill up. If this happens, the count of the data is returned as a worst case estimate. Further increasing the number of bits serves to minimize the worst case eror. If there are 10 bits per distinct element, then the estimate is guaranteed to be within 1% of the true.
In addition to saving space, the Bloom Filter is potentially much
faster than CountDistinct for queries with a large number of distinct
values. Merging states is done via bitwise OR on the Bloom Filter data
structure and therefore the run time is practically independent of the
size of the hashes, as it dominated by the time spent processing each
tuple. In theory, the runtime is O(n + k), where n is the
number of rows in data
and k is the number of distinct
elements.
An object of class "data"
with a single attribute. Upon
conversion to a data frame, it will contain a single row.
In the case of inputs = AUTO
, all attributes of the data are
used.
outputs
is not allowed to be AUTO
.
Jon Claus, <jonterainsights@gmail.com>, Tera Insights LLC
CountDistinct
for a similarly functioning GLA.
1 2 3 4 5 6 7 8 9 10 | ## number of bits is too small, count returned
data <- Read(lineitem100g)
agg <- BloomFilter(data, exponent = 20, inputs = l_partkey)
result <- as.data.frame(agg)
## number of bits is sufficient.
## result is 19, 999, 077. True value is 20, 000, 000.
data <- Read(lineitem100g)
agg <- BloomFilter(data, exponent = 24, inputs = l_partkey)
result <- as.data.frame(agg)
|
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.