computeCorrelations: Compute correlation between pairs of columns.

Description Usage Arguments Value See Also Examples

View source: R/computeCorrelations.R

Description

Compute global correlation between all pairs of numeric columns in table. Result includes all pairwise combinations of numeric columns in the table, with optionally limiting columns to those in the parameter include or/and excluding columns defined by parameter except. Limit computation on the table subset defined with where. Use output='matrix' to produce results in matrix format (compatible with function cor).

Usage

1
2
3
computeCorrelations(channel, tableName, tableInfo, include = NULL,
  except = NULL, where = NULL, by = NULL, output = c("data.frame",
  "matrix"), test = FALSE)

Arguments

channel

connection object as returned by odbcConnect

tableName

database table name

tableInfo

pre-built summary of data to use (must have with test=TRUE)

include

a vector of column names to include. Output never contains attributes other than in the list. When missing all columns from tableInfo included.

except

a vector of column names to exclude. Output never contains attributes from the list.

where

specifies criteria to satisfy by the table rows before applying computation. The creteria are expressed in the form of SQL predicates (inside WHERE clause).

by

vector of column names to group results by one or more table columns for faceting or alike (optional).

output

Default output is a data frame of column pairs with correlation coefficient (melt format). To return correlation matrix compatible with function cor use 'matrix' .

test

logical: if TRUE show what would be done, only (similar to parameter test in RODBC functions like sqlQuery and sqlSave).

Value

data frame with columns:

Note that while number of correlations function computes is choose(N, 2), where N is number of table columns specified, resulting data frame contains twice as many rows by duplicating each correlation value with swaped column names (1st column to 2d and 2d to 1st positions). This makes resulting data frame symmetrical with respect to column order in pairs and is necessary to correctly visualize correlation matrix with createBubblechart.

See Also

createBubblechart and showData.

Examples

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
if(interactive()){
# initialize connection to Lahman baseball database in Aster 
conn = odbcDriverConnect(connection="driver={Aster ODBC Driver};
                         server=<dbhost>;port=2406;database=<dbname>;uid=<user>;pwd=<pw>")

cormat = computeCorrelations(channel=conn, "pitching_enh", sqlColumns(conn, "pitching_enh"), 
                             include = c('w','l','cg','sho','sv','ipouts','h','er','hr','bb',
                                         'so','baopp','era','whip','ktobb','fip'),
                             where = "decadeid = 2000", test=FALSE)
# remove duplicate correlation values (no symmetry)
cormat = cormat[cormat$metric1 < cormat$metric2, ]
createBubblechart(cormat, "metric1", "metric2", "value", label=NULL, fill="sign")

# Grouped by columns
cormatByLg = computeCorrelations(channel=conn, "pitching_enh", 
                                 include=c('w','sv','h','er','hr','bb','so'),
                                 by=c('lgid','decadeid'), 
                                 where = "decadeid >= 1990")
                                 
createBubblechart(cormatByLg, "metric1", "metric2", "value", 
                  label=NULL, fill="sign", facet=c('decadeid','lgid'), 
                  title="Correlations by Leagues and Decades",
                  defaultTheme = theme_wsj(), legendPosition = 'none')
}

toaster documentation built on May 30, 2017, 3:51 a.m.