countDistinct: Count Distinct Elements

Description Usage Arguments Details Value AUTO Author(s) See Also Examples

Description

Counts the number of distinct combinations for the given attributes.

Usage

1
CountDistinct(data, inputs = AUTO, outputs = count)

Arguments

data

an object of class "data".

inputs

which attributes of data to perform the GLA on.

outputs

the desired column name of the result.

Details

This GLA counts the number of distinct combinations of the given inputs using a full hashing of the distinct combinations. As such, it requires O(k) space, where k is the number of distinct combinations. The run time is O(n + k), where n is the number of rows in data. The second term is a result of having to merge hashes between different states. Having a large number of distinct values leads to significant slowdown because of this; the BloomFilter is recommended for these queries.

Value

An object of class "data" exactly one row element. Upon conversion to a data frame, it will contain a single row.

AUTO

In the case of inputs = AUTO, all attributes of the data are used.

Author(s)

Jon Claus, <jonterainsights@gmail.com>, Tera Insights LLC

See Also

BloomFilter for a similarly functioning GLA.

Examples

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
## result is equal to total number of tuples, no repitiions
data <- Read(lineitem100g)
agg <- CountDistinct(data, inputs = c(l_tax, l_quantity, l_partkey))
result <- as.data.frame(agg)

## result is equal number of possible values for l_partkey as given
## in the specifications of TPC-H
data <- Read(lineitem100g)
agg <- CountDistinct(data, inputs = l_partkey)
result <- as.data.frame(agg)

tera-insights/gtBase documentation built on May 31, 2019, 8:35 a.m.