computeSample: Randomly sample data from the table.

Description Usage Arguments Details Examples

View source: R/computeSample.R

Description

Draws a sample of rows from the table randomly. The function offers two sampling approaches and three stratum strategies. Sampling approaches by

Stratum strategies:

Usage

1
2
3
computeSample(channel, tableName, sampleFraction, sampleSize,
  conditionColumn = NULL, conditionStratum = NULL, conditionValues = NULL,
  include = NULL, except = NULL, where = NULL, ..., test = FALSE)

Arguments

channel

connection object as returned by odbcConnect

tableName

table name

sampleFraction

one or more sample fractions to use in the sampling of data. Multipe sampling fractions are applicable only in combination with the arguments conditionColumn and conditionValues when present. In this case number of fractions in sampleFraction and number of values in conditionValues must be the same.

sampleSize

total sample size (applies only when sampleFraction is missing). This may too be a vector of total values when used in combination with the arguments conditionColumn and conditionValues. In this case number of sizes in sampleSize and number of values in conditionValues must be the same.

conditionColumn

if you use this argument, you must also use the conditionValues argument. Either both are used, or neither is used. Values in a particular column conditionColumn are used as sampling conditions directly and its data type must be of a group-able type. Only those values listed in conditionValues are used for sampling with the rest ignored. Also, see conditionStratum.

conditionStratum

if you use this argument, you must also use the conditionValues argument. When defined it is used in place of conditionColumn. conditionStratum should define a SQL expression (usually using CASE function but not necessarily). Resulting sample data frame will contain a column named stratum just as if conditionColumn = 'stratum' was used. Arguments conditionColumn and conditionStratum are mutually exclusive: the former is ignored if both are defined.

conditionValues

see argument conditionColumn and conditionStratum.

include

a vector of column names to include. Output never contains attributes other than in the list.

except

a vector of column names to exclude. Output never contains attributes from the list.

where

specifies criteria to satisfy by the table rows before applying computation. The creteria are expressed in the form of SQL predicates (inside WHERE clause).

...

additional arguments to be passed to sqlQuery for more control over performance and data type conversion. By default, stringsAsFactors is set to FALSE.

test

logical: if TRUE show what would be done, only (similar to parameter test in RODBC functions like sqlQuery and sqlSave).

Details

The sampling can be applied to the entire table or can be refined with either conditionColumn or conditionStratum. In each case a subset of the table defined with where argument may apply too. The resulting stratum models are:

No columns returned as character and not excluded by as.is are converted to factors by default, i.e. stringsAsFactors = FALSE when calling sqlQuery if not specified when calling this function.

Examples

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
if(interactive()){
# initialize connection to Lahman baseball database in Aster 
conn = odbcDriverConnect(connection="driver={Aster ODBC Driver};
                         server=<dbhost>;port=2406;database=<dbname>;uid=<user>;pwd=<pw>")

batters = computeSample(conn, "batting", sampleFraction=0.01)
dim(batters)

pitchersAL = computeSample(conn, "pitching", sampleSize=1000, where="lgid = 'AL'")
dim(ptichersAL)

battersByDecadesSingleSize = computeSample(conn, tableName = 'batting_enh', 
                                           sampleSize=1000, 
                                           conditionColumn = 'decadeid', 
                                           conditionValues = c(1990,2000,2010))
dim(battersByDecadesSingleSize)

battersByDecades = computeSample(conn, tableName = 'batting_enh',
                                 sampleFraction = c(0.01,0.01,0.02), 
                                 conditionColumn = 'decadeid', conditionValues = c(1990,2000,2010))
dim(battersByDecades)

battersByOddEvenYears = computeSample(channel=NULL, tableName = 'batting_enh',
                                      sampleFraction = c(0.01,0.02),
                                      include = c('decadeid','g','ab','r','h'),
                                      conditionStratum = "yearid % 2", 
                                      conditionValues = c('0','1'),
                                      where = "lgid = 'NL'")
dim(battersByOddEvenYears)

battersBeforeAfter1960 = computeSample(channel=NULL, tableName = 'batting_enh',
                                       sampleSize = c(200, 200), 
                                       conditionStratum = "CASE WHEN yearid <- 1960 THEN 'before'
                                                                ELSE 'after'
                                                          END", 
                                       conditionValues = c('before','after'))
dim(battersBeforeAfter1960)
}

toaster documentation built on May 30, 2017, 3:51 a.m.