countDistinct: Count Distinct Elements
In tera-insights/gtBase: R Interface for the Grokit System

Description Usage Arguments Details Value AUTO Author(s) See Also Examples

Counts the number of distinct combinations for the given attributes.

1	CountDistinct(data, inputs = AUTO, outputs = count)

`data`	an object of class `"data"`.
`inputs`	which attributes of `data` to perform the GLA on.
`outputs`	the desired column name of the result.

This GLA counts the number of distinct combinations of the given inputs using a full hashing of the distinct combinations. As such, it requires O(k) space, where k is the number of distinct combinations. The run time is O(n + k), where n is the number of rows in data. The second term is a result of having to merge hashes between different states. Having a large number of distinct values leads to significant slowdown because of this; the BloomFilter is recommended for these queries.

An object of class "data" exactly one row element. Upon conversion to a data frame, it will contain a single row.

In the case of inputs = AUTO, all attributes of the data are used.

Jon Claus, <jonterainsights@gmail.com>, Tera Insights LLC

BloomFilter for a similarly functioning GLA.

## result is equal to total number of tuples, no repitiions
data <- Read(lineitem100g)
agg <- CountDistinct(data, inputs = c(l_tax, l_quantity, l_partkey))
result <- as.data.frame(agg)

## result is equal number of possible values for l_partkey as given
## in the specifications of TPC-H
data <- Read(lineitem100g)
agg <- CountDistinct(data, inputs = l_partkey)
result <- as.data.frame(agg)