univariate: Univariate Statistics

Description Usage Arguments Details Value Author(s) References

Description

These GLAs compute various univariate statistics separately for each input.

Usage

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
Sum(data, inputs, outputs)

Mean(data, inputs = AUTO, outputs = AUTO)

Min(data, inputs = AUTO, outputs = AUTO)

Max(data, inputs = AUTO, outputs = AUTO)

Median(data, inputs = AUTO, outputs = result, number.bins = 1000,
  sort.threshold = 1000)

Arguments

data

A waypoint.

outputs

The usual way to specify the outputs. If both this and names for the inputs are given, a warning is given and outputs is used.

number.bins

The number of bins to use in the binning algorithm.

sort.threshold

The maximum number of items on which to manually sort.

input

A named list of expressions, with the names being used as the corresponding outputs. These expressions are outputted in addition to those used to specify the extremities.

If no name is given and the corresponding expression is simply an attribute, then said attribute is used as the name. Otherwise an error is thrown, as there is no reason to include an extra input if corresponding output column cannot be referenced later.

Details

The result of each GLA is a waypoint with one column per input and a single row whose value is the specified univariate statistic for the corresponding expression.

With the exception of finding the median, all of these aggregates are fairly straightforward, require O(k) space, and run in O(n \cdot k) time, where k is the number of inputs and n is the number of tuples.

The median algorithm relies on a iterative binning algorithm, based on the Tibshirani paper. This algorithm requires two parameters: the number of bins to use (b) and the threshold at which to sort (t). During the first iteration, the range of the input is found. This interval is then split into b equal parts. Each input is then sorted into bins and the bin that must contain the median is then sub-divided into b equal parts. This recursive sub-division continues until less than t elements are in a bin that contains the median. These elements are then sorted and the median is outputted. As such, this algorithm requires O(k \cdot b) spaces and runs in O(k \cdot (n \cdot \log_b n + t \log t)) time.

Value

A waypoint with a single row. See ‘details’ for more information.

Author(s)

Jon Claus, <jonterainsights@gmail.com>, Tera Insights, LLC.

References

hrefhttp://www.stat.cmu.edu/~ryantibs/papers/median.pdfTibshirani for details regarding the binning algorithm.


tera-insights/gtBase documentation built on May 31, 2019, 8:35 a.m.