CountDistinct: Count Distinct Combinations

Description Usage Arguments Details Value Author(s) See Also Examples

Description

Counts the number of distinct combinations for the given expressions.

Usage

1

Arguments

data

A waypoint object.

inputs

The expressions whose distinct combinations are counted.

outputs

The column name of the result.

Details

This GLA counts the number of distinct combinations of the given inputs using a full hashing of the distinct combinations. As such, it requires O(k) space, where k is the number of distinct combinations. The run time is O(n + k), where n is the number of rows in data. The second term is a result of having to merge hashes between different states. Having a large number of distinct values leads to significant slowdown because of this; the BloomFilter is recommended for these queries.

Value

A waypoint containing a single row and column whose name is given by output.

Author(s)

Jon Claus, <jonterainsights@gmail.com>, Tera Insights, LLC.

See Also

BloomFilter for a similar GLA.

BloomFilter for a similarly functioning GLA.

Examples

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
## result is equal to total number of tuples, no repitiions
data <- Read(lineitem100g)
agg <- CountDistinct(data, inputs = c(l_tax, l_quantity, l_partkey))
result <- as.data.frame(agg)

## result is equal number of possible values for l_partkey as given
## in the specifications of TPC-H
data <- Read(lineitem100g)
agg <- CountDistinct(data, inputs = l_partkey)
result <- as.data.frame(agg)

tera-insights/gtBase documentation built on May 31, 2019, 8:35 a.m.