Gets a stratified sample of data from Causata

Share:

Description

Extracts a stratified sample of data

Usage

1
2
3
GetStratifiedSample(connect, query, stratification.variable, 
                    stratification.variable.name, 
                    stratification.value=0)

Arguments

connect

Causata connect object - used to resample at the stratified sampling rates.

query

Causata query object - used to resample at the stratified sampling rates. Note that the Limit must be defined.

stratification.variable

A vector of values on which to base the stratification.

stratification.variable.name

The name of the Causata variable that is used as the basis of stratification.

stratification.value

Value of the stratification.variable which will determine the stratum for a record.

Details

This function gets a stratified sample of data from Causata. The population will be split into two strata based on whether the stratification.variable value for a record matches the stratification.value. Sampling rates for the two strata are then calculated where the rate for the larger strata, strata.A is:

sample.rate.A = sqrt((# records in strata.B) / (# records in strata.A))

New queries are run to resample the Causata data at these sample rates.

Value

Returns a list with two elements as follows:

df

A dataframe of sampled data containing all of the variables found in query.

weights

A vector of weight values. The weights are the inverse of the probability of selecting a record in the sample.

Author(s)

Suzanne Weller <support@causata.com>

See Also

Connect, Query, Limit.

Examples

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
# create some variables to query for
variables <- c('customer-id', 'total-spend')

# create a stratified sample given an initial query
# The commands below are commented out since they require an actual server connection
#connection <- Connect(hostname="server.causata.com",
#  username="user@gmail.com", password="enw8Q!mN")
#query <- Query() + Limit(500)
#df <- GetData(connection, query)

# The commands below are commented out since they require an actual server connection
#sampled.data <- GetStratifiedSample(connection, query, 
#  df[['has.purchase__Next.30.Days']], 'has.purchase__Next.30.Days', "true")
#table(sampled.data$weights)

Want to suggest features or report bugs for rdrr.io? Use the GitHub issue tracker.