getTableSummary: Compute columnwise statistics on Aster table.

Description Usage Arguments Details Value See Also Examples

Description

For table compute column statistics in Aster and augment data frame structure obtained with sqlColumns with columns containing computed statistics.

Usage

1
2
3
getTableSummary(channel, tableName, include = NULL, except = NULL,
  modeValue = FALSE, percentiles = c(5, 10, 25, 50, 75, 90, 95, 100),
  where = NULL, mock = FALSE, parallel = FALSE)

Arguments

channel

object as returned by odbcConnect.

tableName

name of the table in Aster.

include

a vector of column names to include. Output never contains attributes other than in the list.

except

a vector of column names to exclude. Output never contains attributes from the list.

modeValue

logical indicates if mode values should be computed. Default is FALSE.

percentiles

list of percentiles (integers between 0 and 100) to collect (always collects 25th and 75th for IQR calculation). There is no penalty in specifying more percentiles as they get calculated in a single call for each column - no matter how many different values are requested. When FALSE then percentiles calculations are skipped and result wil have no percentile and IQR columns.

where

SQL WHERE clause limiting data from the table (use SQL as if in WHERE clause but omit keyword WHERE).

mock

logical: if TRUE returns pre-computed table statistics for tables pitching or batting, only.

parallel

logical: enable parallel calls to Aster database. This option requires parallel backend enabled and registered (see in examples). Parallel execution requires ODBC channel obtained without explicit password: either with odbcConnect(dsn) or odbcDriverConnect calls, but not with odbcConnect(dsn, user, password).

Details

Computes columns statistics for all or specified table columns and adds them to the data frame with basic ODBC table metadata obtained with sqlColumns. Computed statistics include counts of all, non-null, distinct values; statistical summaries of maximum, minimum, mean, standard deviation, median (50th percentile), mode (optional), interquartile range, and desired percentiles. Each computed statistic adds a column to ODBC metadata data frame.

Value

data frame returned by sqlColumns with additional columns:

total_count

total row count - the same for each table column

distinct_count

distinct values count

not_null_count

not null count

minimum

minimum value (numerical data types only)

maximum

maximum value (numerical data types only)

average

mean (numerical data types only)

deviation

standard deviation (numerical data types only)

percentiles

defaults: 0,5,10,25,50,75,90,95,100. Always adds percentiles 25, 50 (median), 75

IQR

interquartile range is the 1st Quartile subtracted from the 3rd Quartile

minimum_str

minimum string value (character data types only)

maximum_str

maximum string value (character data types only)

mode

mode value (optional)

mode_count

mode count (optional)

See Also

sqlColumns

Examples

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
if(interactive()){
# initialize connection to Lahman baseball database in Aster 
conn = odbcDriverConnect(connection="driver={Aster ODBC Driver};
                         server=<dbhost>;port=2406;database=<dbname>;uid=<user>;pwd=<pw>")

pitchingInfo = getTableSummary(channel=conn, 'pitching_enh')
# list all table columns
pitchingInfo$COLUMN_NAME

# compute statistics on subset of baseball data after 1999
battingInfo = getTableSummary(channel=conn, 'batting_enh', 
                              where='yearid between 2000 and 2013')
                              
# compute statistics for certain columns including each percentile from 1 to 99
pitchingInfo = getTableSummary(channel=conn, 'pitching_enh',
                              include=c('h', 'er', 'hr', 'bb', 'so'),
                              percentiles=seq(1,99))
# list data frame column names to see all computed statistics
names(pitchingInfo)
                             
# compute statitics on all numeric columns except certain columns
teamInfo = getTableSummary(channel=conn, 'teams_enh', 
                   include=getNumericColumns(sqlColumns(conn, 'teams_enh')),
                   except=c('lgid', 'teamid', 'playerid', 'yearid', 'decadeid'))
}

teradata-aster-field/toaster documentation built on May 31, 2019, 8:36 a.m.