bloomFilter: Counting with a Bloom Filter

Description Usage Arguments Details Value AUTO Author(s) See Also Examples

Description

Approximates the number of distinct combinations for the given attributes.

Usage

1
BloomFilter(data, exponent = 16, inputs = AUTO, outputs = count)

Arguments

data

an object of class "data".

exponent

the size of the hash is 2^n bits.

inputs

which attributes of data to perform the GLA on.

outputs

the desired column names of the result.

Details

This GLA functions similarly to CountDistinct but it uses a Bloom Filter to check for repeats rather than a full hash. By doing so, it requires much less space at the cost of accuracy. The error can be minimized by ensuring that exponent is set correctly.

The number of bits should be at least a tenth of the number of distinct elements. If not, then it is likely that hash will completely fill up. If this happens, the count of the data is returned as a worst case estimate. Further increasing the number of bits serves to minimize the worst case eror. If there are 10 bits per distinct element, then the estimate is guaranteed to be within 1% of the true.

In addition to saving space, the Bloom Filter is potentially much faster than CountDistinct for queries with a large number of distinct values. Merging states is done via bitwise OR on the Bloom Filter data structure and therefore the run time is practically independent of the size of the hashes, as it dominated by the time spent processing each tuple. In theory, the runtime is O(n + k), where n is the number of rows in data and k is the number of distinct elements.

Value

An object of class "data" with a single attribute. Upon conversion to a data frame, it will contain a single row.

AUTO

In the case of inputs = AUTO, all attributes of the data are used.

outputs is not allowed to be AUTO.

Author(s)

Jon Claus, <jonterainsights@gmail.com>, Tera Insights LLC

See Also

CountDistinct for a similarly functioning GLA.

Examples

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
## number of bits is too small, count returned
data <- Read(lineitem100g)
agg <- BloomFilter(data, exponent = 20, inputs = l_partkey)
result <- as.data.frame(agg)

## number of bits is sufficient.
## result is 19, 999, 077. True value is 20, 000, 000.
data <- Read(lineitem100g)
agg <- BloomFilter(data, exponent = 24, inputs = l_partkey)
result <- as.data.frame(agg)

tera-insights/gtBase documentation built on May 31, 2019, 8:35 a.m.