BloomFilter: Counting using a Bloom Filter

Description Usage Arguments Details Value Author(s) References See Also Examples

Description

BloomFilter approximates the number of distinct combinations of the given inputs.

Usage

1
BloomFilter(data, inputs, outputs, exponent = 16)

Arguments

data

A waypoint object.

inputs

The expressions whose distinct combinations are counted.

outputs

The column name of the result.

exponent

The size of the filter is 2^{exponent} bits.

Details

This GLA functions similarly to CountDistinct but it uses a Bloom filter to check for repeats rather than a full hash table. By doing so it requires much less space at the cost of accuracy. The error can be minimized by ensuring that exponent is set correctly. The number of bits should be at least a tenth of the number of distinct elements. If not, then it is likely that hash will completely fill up. If this happens, the count of the data is returned as a worst case estimate. Further increasing the number of bits serves to minimize the worst case eror. If there are 10 bits per distinct element, then the estimate is guaranteed to be within 1% of the true value. In addition to saving space, the Bloom Filter is potentially much faster than CountDistinct for queries with a large number of distinct values. Merging states is done via bitwise OR on the Bloom Filter data structure and therefore the run time is practically independent of the size of the hashes, as it dominated by the time spent processing each tuple. In theory, the runtime is O(n + k), where n is the number of rows in data and k is the number of distinct elements.

Value

A waypoint containing a single row and column whose name is given by outputs.

Author(s)

Jon Claus, jonterainsights@gmail.com, Tera Insights, LLC.

References

hrefhttp://en.wikipedia.org/wiki/Bloom_filterWikipedia for more information regarding the Bloom filter.

See Also

CountDistinct for a similar GLA.

Examples

1
2
3
4
5
6
7
8
9
## number of bits is too small, count returned
data <- Read(lineitem100g)
agg <- BloomFilter(data, exponent = 20, inputs = l_partkey)
result <- as.data.frame(agg)
## number of bits is sufficient.
## result is 19 999 077. True value is 20 000 000.
data <- Read(lineitem100g)
agg <- BloomFilter(data, exponent = 24, inputs = l_partkey)
result <- as.data.frame(agg)

tera-insights/gtBase documentation built on May 31, 2019, 8:35 a.m.