CountDistinct: Count Distinct Combinations
In tera-insights/gtBase: R Interface for the Grokit System

Description Usage Arguments Details Value Author(s) See Also Examples

Counts the number of distinct combinations for the given expressions.

1	CountDistinct(data, inputs, outputs = count)

`data`	A `waypoint` object.
`inputs`	The expressions whose distinct combinations are counted.
`outputs`	The column name of the result.

This GLA counts the number of distinct combinations of the given inputs using a full hashing of the distinct combinations. As such, it requires O(k) space, where k is the number of distinct combinations. The run time is O(n + k), where n is the number of rows in data. The second term is a result of having to merge hashes between different states. Having a large number of distinct values leads to significant slowdown because of this; the BloomFilter is recommended for these queries.

A waypoint containing a single row and column whose name is given by output.

Jon Claus, <jonterainsights@gmail.com>, Tera Insights, LLC.

BloomFilter for a similar GLA.

BloomFilter for a similarly functioning GLA.

## result is equal to total number of tuples, no repitiions
data <- Read(lineitem100g)
agg <- CountDistinct(data, inputs = c(l_tax, l_quantity, l_partkey))
result <- as.data.frame(agg)

## result is equal number of possible values for l_partkey as given
## in the specifications of TPC-H
data <- Read(lineitem100g)
agg <- CountDistinct(data, inputs = l_partkey)
result <- as.data.frame(agg)