Description Usage Arguments Details Value Author(s) References See Also Examples
BloomFilter
approximates the number of distinct combinations of the
given inputs.
1 | BloomFilter(data, inputs, outputs, exponent = 16)
|
data |
A |
inputs |
The expressions whose distinct combinations are counted. |
outputs |
The column name of the result. |
exponent |
The size of the filter is 2^{exponent} bits. |
This GLA functions similarly to CountDistinct
but it uses a
Bloom filter to check for repeats rather than a full hash table. By doing so
it requires much less space at the cost of accuracy. The error can be
minimized by ensuring that exponent is set correctly.
The number of bits should be at least a tenth of the number of distinct
elements. If not, then it is likely that hash will completely fill up. If
this happens, the count of the data is returned as a worst case estimate.
Further increasing the number of bits serves to minimize the worst case eror.
If there are 10 bits per distinct element, then the estimate is guaranteed to
be within 1% of the true value.
In addition to saving space, the Bloom Filter is potentially much faster than
CountDistinct for queries with a large number of distinct values. Merging
states is done via bitwise OR on the Bloom Filter data structure and
therefore the run time is practically independent of the size of the hashes,
as it dominated by the time spent processing each tuple. In theory, the
runtime is O(n + k), where n is the number of rows in data
and k is the number of distinct elements.
A waypoint
containing a single row and column whose
name is given by outputs
.
Jon Claus, jonterainsights@gmail.com, Tera Insights, LLC.
hrefhttp://en.wikipedia.org/wiki/Bloom_filterWikipedia for more information regarding the Bloom filter.
CountDistinct
for a similar GLA.
1 2 3 4 5 6 7 8 9 | ## number of bits is too small, count returned
data <- Read(lineitem100g)
agg <- BloomFilter(data, exponent = 20, inputs = l_partkey)
result <- as.data.frame(agg)
## number of bits is sufficient.
## result is 19 999 077. True value is 20 000 000.
data <- Read(lineitem100g)
agg <- BloomFilter(data, exponent = 24, inputs = l_partkey)
result <- as.data.frame(agg)
|
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.