A Story

For experienced data scientists with a technical background, feel free to skim before starting at Whizbang 3.0.

Whizbang 1.0: A prototype

Arnold is a data-scientist at Ploopli. He develops a new program called Whizbang 1.0 that queries and creates a variety of tables in a database. The Whizbang data pipeline has two steps:

whizbang.r

# Create a table with a random sample of users
query <- '
  CREATE TABLE arnold.wb_proto_user_sample AS (
    SELECT *
    FROM ploopli.user_hits
    WHERE hash(userid)%42 = 1
  )
'
ploopli_run_database_query(query)


# Create 3 features for each user: n_hits, min_datetime, max_datetime
query <- '
  CREATE TABLE arnold.wb_proto_user_sample_features AS (
    SELECT userid,
      count(userid) as n_hits,
      min(date_time) as min_datetime,
      max(date_time) as max_datetime
    FROM arnold.wb_proto_user_sample
  )
'
ploopli_run_database_query(query)

After testing his prototype, Arnold is fired up. Whizbang is going to take the world by storm. Arnold shares Whizbang with his boss, who takes a sip of coffee, looks at the code, and sprays coffee all over his VT-100 out of sheer amazement. His boss recommends Arnold work with Barb immediately to develop Whizbang 2.0.

Whizbang 2.0: Templated queries

Barb takes a look at the code and comes back almost immediately. She can't run the code because it has Arnold's dataset hard-coded in all of the queries.

It's clear that if Arnold and Barb are to collaborate on the development of Whizbang 2.0, they will need to separate parameters like tablenames and variables from the logic of their queries.

Arnold and Barb put their heads together and think

"Wouldn't it be nice if we could each define our own configuration and then dynamically generate the queries when we run Whizbang?".

Arnold discovers the mustache templating language and realizes that he can use the whisker R package to template the queries in Whizbang. Now Whizbang contains templated queries like:

arnold_config.r

config <- data.frame(
    dataset_id = 'arnold'
)

whizbang.r

# Create a table with a random sample of users
query <- '
  CREATE TABLE {{{dataset_id}}}.wb_proto_user_sample AS (
    SELECT *
    FROM ploopli.user_hits
    WHERE hash(userid)%42 = 1
  )
'
ploopli_run_database_query(whisker.render(query,config))


# Create 3 features for each user, n_hits, min_datetime, max_datetime
query <- '
  CREATE TABLE {{{dataset_id}}}.wb_proto_user_sample_features AS (
    SELECT userid,
      count(userid) as n_hits,
      min(date_time) as min_datetime,
      max(date_time) as max_datetime
    FROM {{{dataset_id}}}.wb_proto_user_sample
  )
'
ploopli_run_database_query(whisker.render(query,config))

When Arnold and Barb now run Whizbang 2.0, their tables happily live in their own datasets.

Code that separates configuration from logic is easy to share and means the code only changes if its logic changes.

Whizbang 3.0: Query-driven pipelines

Whizbang 2.0 is successfully deployed and starts making waves around the globe. Orders for VT-100s rise as bosses everywhere spray coffee in disbelief. Euphoric with the success of Whizbang 2.0, Arnold's boss calls a meeting:

Multiple teams at Ploopli need Whizbang to create daily random samples of users. Further, users are now assigned a region code. The number of region codes is currently 3, but will grow as Ploopli grows. Whizbang needs to create separate daily sample tables for each region.

Arnold and Barb put their heads together and think

"Wouldn't it be nice if we could get region codes dynamically via a database query and then dynamically run Whizbang for each region on a particular date?".

Barb finds the condusco R package. With condusco, she creates a pipeline called create_daily_region_user_sample and repeatedly runs it for each of the regions she gets from a query.

whizbang/create_daily_region_user_sample.r

create_daily_region_user_sample <- function(config){
  # Create a table with a random sample of users
  query <- '
    CREATE TABLE {{{dataset_id}}}.wb_proto_user_sample_{{{region}}}_{{{date_suffix}}} AS (
      SELECT *
      FROM ploopli.user_hits
      WHERE hash(userid)%42 = 1
        AND DATE_FORMAT(date_time, "%Y%m%d") = "{{{date_suffix}}}"
    )
  '
  ploopli_run_database_query(whisker.render(query,config))
}

barb_run.r

run_pipeline_dbi(
  create_daily_region_user_sample,
  #invocation query
  '
    SELECT "barb" as dataset_id,
      DATE_FORMAT(CURRENT_DATE(), "%Y%m%d") as date_suffix,
      region
    FROM (
      SELECT region
      FROM ploopli.user_hits
      GROUP BY 1
    )
  ',
  dbi_connection_to_ploopli
)

The run_pipeline_dbi function runs the query provided to it, called the invocation query, and receives a result. For each row in the result, it calls the create_daily_region_user_sample pipeline and passes that row to the pipeline as a dataframe.

When Arnold and Barb are ready to put the pipeline into production, they don't have to change the pipeline definition. They just change their invocation query, changing the dataset_id to 'prod'. When new regions are added, the pipeline automatically creates new tables for those regions.

With condusco, Arnold and Barb can iteratively execute any pipeline function, not just queries.

"condusco lets you iteratively run the same process for each of the results of a query"

Whizbang 3.1: Query-driven query generation

Having discovered condusco, Barb and Arnold decide they want to expand their feature generation pipeline. Ploopli users visit dozens of categories of Ploopli products. Arnold has an idea to dynamically generate certain features:

Arnold can accomplish this with condusco, leveraging whisker and condusco's feature that will convert any field that is valid JSON into an object before passing it to the pipeline for execution. Since whisker can template repeated fields in JSON, Arnold can dynamically generate a set of features for each user. Let's take a look:

whizbang/create_top_n_features.r

create_top_n_features <- function(config){
  query <- '
    CREATE TABLE {{{dataset_id}}}.user_features_top_{{{n}}}_{{{fieldname}}} AS (
      SELECT
        {{{#top_features}}}
          SUM(CASE WHEN {{{fieldname}}}="{{{val}}}" THEN 1 ELSE 0 END )as n_hits_{{val}},
        {{{/top_features}}}
        userid
      FROM {{{source_table}}}
      GROUP BY userid
    );
  '
  ploopli_run_database_query(whisker.render(query,config))
}

arnold_run.r

run_pipeline_gbq_dbi(
  create_top_n_features,'
    SELECT
      "arnold" as dataset_id,
      "product_category" as fieldname,
      10 as n,
      "[" || GROUP_CONCAT("{ ""val"": """ || product_category ||  """ }") || "]" AS top_features,
      "arnold.user_sample" as source_table
    FROM (
      SELECT
        product_category,
        count(product_category) as n_hits
      FROM ploopli.user_hits
      GROUP BY 1
      ORDER BY 2 DESC
      LIMIT 10
    )
    GROUP BY 1,2,3
  ',
  dbi_connection_to_ploopli
)

In his invocation query, Arnold creates a JSON string called top_features representing an array of 10 objects with one attribute called val. Each of the val attributes contains one product_category. When he calls create_top_n_features, condusco converts the JSON string into an object before passing it to create_top_n_features, where the whisker.render() function can now expand the object into the templated query.

"condusco automatically converts JSON strings into objects for use by the pipeline"

Bonus

Since Arnold created the create_top_n_features pipeline with a fieldname option, it can now be used to create the top n features for any field in the source dataset, not only product_category. Multiple users can now call the create_top_n_features pipeline for similar feature generation tasks without having to reinvent the wheel.

Epilogue

Arnold and Barb started out with a prototype designed for one user and gradually added functionality to:

Many coffee-splattered VT-100s later, Arnold and Barb are able to consider releasing a library of generalized pipelines that can be used across Ploopli.



ras44/condusco documentation built on June 15, 2019, 12:45 p.m.