r-api-pitch.md

R Project Pitch / Analytical Projects Initiative (API) 2018-19

Reframe as "a prototype/pilot development analysis workflow using R, with an application to using ML to match two datasets (that we need for research + data development)." R and its tools helped to accomplish this. (And a decent amount of R expertise already exists in Statcan for various reasons---previous jobs, school, side projects, general interest.)

[Cite opinionated development analysis workflow from H Parker / @hspter, especially these slides and paper.]

Reproducible and Auditable Executable analysis scripts Defined dependencies Watchers for changed code and data Version control (individual) * Code review

Accurate Modular, tested code Assertive testing of data, assumptions and results * Code review

Collaborative Version control (collaborative) Issue tracking

From Jesse Maegan: "I’ve found it to be most helpful to pilot a workflow or initiative with clear indicators of success as well as scalability and sustainability. Working with a smaller group before an org-wide roll-out helps you suss out (and address) any potential problems, demonstrate success, and get more organizational buy-in.

Ultimately you want people to want to adopt this workflow, not be told that they have to."

Outline

  1. Underlying philosophy: transparency, reproducibility, modernization, discoverability, knowledge transfer
  2. Business objective: match two datasets (a non-standard application of ML and matching, aka record linkage, entity resolution, document retrieval), then do analysis.
  3. Deliverable: an R pkg that encapsulates data development, analysis and evaluation for this project, including documentation and code.
  4. Timeline: Soon-ish.
  5. Budget: mainly just me.
  6. Stakeholders:

More:

Scope? Deliverable? Timeline? Budget? Stakeholders (IT, users, developers, us, data ppl, and ppl that can use the code in the future).

Scope: a single package, a single project. But a very visible example of openness and reproducibility that can provide a template for a workflow...?

Introduction

Match BR + TCOD. First goal is to get it to work. After that, focus on philosophy: transparency, reproducibility, modernization, discoverability, knowledge transfer. So the lessons/skills/code can be applied to more than just one project.

Research objective

We have shipments with information on origin and destination firms (sometimes names, usually addresses). I want to match these data to firms on the BR, so we can infer supply chains, transportation logistics information, intra-firm vs. inter-firm trade, and more. The basic obstacle in this project is that a large subset of shipments on the TCOD do not have shipper or receiver names, which means I'll rely only on the addresses to match. I use machine learning techniques to evaluate an address-only match model on the subset of the TCOD with names and addresses, and apply that model to predict the out-of-sample subset of the TCOD without addresses.

The results of this project are inputs into economic research projects, transportation projects (Transport division and Transport Canada), and informs re-development of transportation surveys.

Data sources

BR and TCOD. Along with postal codes, geography files, other things.

Methodology

Name and address standardization. Machine learning. Matching, using na and a models.

Results

A dataset of matches between the TCOD and BR. Inputs into many things.

Non-research objective (aka process improvement)

One metric I use to judge the success of a project I've finished: how much have I improved the tool / process I used to do this project? In other words, if I started this project over again from scratch, how much faster could I finish it? Most research/analytics projects I've been involved in have resulted in very small or no process improvements. I would guess our research workflows have not changed much in many years. [Ask if anyone knows where the data is, request the data, find no or little documentation, struggle through strange variable definitions (imgeocode in IPTF, anyone?), struggle through concordances and classifications, missing variables, and so on., then try to make the best of it, writing code to do one specific thing, and then no one can ever find that code again. Try to get extra Stata packages installed, find they don't work, do the same for R, switch data back and forth from SAS to .csv to .dta, and so on.]

This project is a prototype of a workflow that aims to improve the process of data development and analysis, mainly through transparency and reproducibility.

A list of the important non-research objectives:

  1. improve code and documentation; knowing you'll re-use code helps you think harder about future users (including and especially yourself)
    • write clear and concise code
    • write clear documentation at the time you write the code
    • write code that can be adapted and used in similar situations
    • write code that follows a common style guide to improve readability and harmonization with existing code
    • write code with demos and unit tests to validate each step of the analysis
    • clearly organize project, code, documentation and data
  2. improve discoverability and knowledge transfer
    • make code, documentation, classifications and concordances (not microdata) publicly available
    • easy for users to discover similar methods, data, code, rules, and overall data development strategies
    • aids user/developer understanding (they can see the details of the analysis, concordance rules, filtering rules)
  3. improve code improvement process
    • make it easy to collaborate with other developers to fix bugs and incorporate improvements
    • once a bug is fixed, make it easy for users to update across all existing projects (instead of manually searching / finding bugs from copy/pasted code snippets someone emailed to you)
    • make it easy for others to test/deploy code with different situations
    • automate the development/analysis update process (add new years of data, apply to different datasets)
  4. reproducibility
    • anyone should be able to test code and analysis
    • anyone with access to microdata should be able to reproduce results, or see how minor changes could affect analysis
    • anyone should be able to apply the code to different situations
    • anyone should be able to look at, understand, re-run, improve, adapt, and/or take-over projects from original developer
  5. (flexibility?)
    • shouldn't be tied to one language, etc.?
    • although I propose R for this, I want to instill a focus on openness and reproducibility that could be translated to whichever language/approach is best for any individual project.

(Tangential benefits: employee professional development, learning specific language and programming things, and make it easier to learn from others' code---also easier to showcase accomplishments publicly.)

Background/general motivation of these principles / philosophy

This data development/analysis philosophy isn't specific to one language or field. Many of these things could be accomplished in every language. Many businesses, academics and governments incorporate some or all of these points. But modern tech companies take it one step further---Airbnb's internal rbnb package to standardize data science, Twitter's etc., Google's open source contributions, etc.

The goal is to emulate these organizations' focus on openness, which arguably should be more important for a government agency---the increasing focus on openness and transparency, as well as the adoption of digital services across departments (see CDS-SNC). Things should be open---classifications, concordances, anything that's not proprietary should at least be available and discoverable internally (e.g., although we can't release postal code conversion file publicly, we should have easier access to a file and its documentation, or give ourselves a data workflow to get an historically accurate file to use for our own purposes).

For this proposal

For this project, I propose to use:

There may be opportunities to develop an internal package repository like CRAN (https://cran.r-project.org/) using drat or similar.

Why R?

R is an open source statistical programming language.

Why RStudio?

Why RMarkdown?

Why these together?

Built to work together. Head of RStudio led development in most popular R packages (ggplot2, dplyr and the tidyverse).

Also wrote very helpful books and tutorials on beginner and advanced R for data science.

Integrated, easy to learn, easy to advance.



tweed1e/matchtools documentation built on May 29, 2019, 10:51 a.m.