r-api-pitch-2.md

R Analysis Workflow Pilot

A pilot development analysis workflow using R, RStudio and RMarkdown, with an application using machine learning to match two datasets. The proposal has two parts: the data development project itself (described in section Statistical methodology), and the workflow supporting that project (described in section Workflow and implementation).

Project summary

Deliverables:

  1. a matched dataset of shipments and firms
  2. reproducible workflow, code and documentation
  3. recommendations and lessons from implementing the workflow

Schedule:

2018-2019 Proposal timeline: April 1st-June 30th, including some data development that's already underway.

Budget:

| Group and level | Name | Home div. | # days | $ |-----------------|-----------|---------------|--------|------ | EC-04 | Jesse Tweedle | EAD | 45 | (in kind) | EC-?? | Claudiu Motoc | EAD | 3 | (in kind) | ??-?? | Data Science Working Grp | BSMD | 3 | (in kind) | EC-?? | Mark Brown | EAD | 3 | (in kind) | EC-?? | Sarah Swan | EAD | 3 | (in kind) | TOTAL | | | | ??

Addressing API Scoring Guide

Building analytical capacity:

Addressing User Information Needs:

Experimentation and the application of leading-edge methods:

Statistical methodology

Introduction

We have shipments with information on origin and destination firms (sometimes names, usually addresses). I want to match these data to firms on the BR, so we can infer supply chains, transportation logistics information, intra-firm vs. inter-firm trade, and more. The basic obstacle in this project is that a large subset of shipments on the TCOD do not have shipper or receiver names, which means I'll rely only on the addresses to match. I use machine learning techniques to evaluate an address-only match model on the subset of the TCOD with names and addresses, and apply that model to predict the out-of-sample subset of the TCOD without addresses.

The results of this project are inputs into economic research projects, transportation projects (Transport division and Transport Canada), and informs re-development of transportation surveys.

Data sources

Methodology

We combine a standard matching methodology with a non-standard out-of-sample match prediction method. The steps are:

A. Standard matching methodology:

  1. BR, TCOD: process and filter each dataset accordingly.
  2. Create subsets to match: T_na, T_a, BR.
  3. Fuzzy blocking on each subset to create candidate matches: Q_na = T_na x BR, and Q_a = T_a x BR.
  4. Predict matches on Q_na using names and addresses (model M_na)

B. Out-of-sample matching predictions:

  1. Predict matches on Q_na using only addresses (model M_a).
  2. Evaluate both using F1 score (combination of precision and recall)
  3. Use M_a model to predict matches on Q_a
  4. Merge matches from Q_na and Q_a back onto TCOD

Results

A dataset of that gives potential matches between the TCOD and BR, along with the model-implied certainty of each of match. A match that uses name and address information is likely to be a certain match, while a match that only uses an address is less certain, especially so if there are two or more business registered at that address. We would like the user of the resulting matched dataset to be able to access and understand this uncertainty.

Workflow and implementation

The workflow for this project is: R for the code, RStudio for the integrated development environment (IDE), and RMarkdown for the documentation/communication.

The data scientist Hilary Parker suggested a data development/analysis workflow should satisfy three conditions: (a) reproducible and auditable, (b) accurate, and (c) transparent. The R/RStudio/RMarkdown combination, when properly implemented, achieves these conditions. (Although these current Statcan analysts have these goals in mind when writing SAS and Stata projects, the R toolkit puts them front and centre and makes them much easier to achieve.)

Why R for this project and workflow?

The R/RStudio/RMarkdown workflow puts these goals first. Writing an R package with these developer tools requires the developer to put organization, documentation and testing first, so the workflow can be reproducible, accurate and transparent.

R

R, the statistical programming language itself, is powerful and open source, and is one of the standard languages used in data science and statistics. R has several packages for testing; the package testthat is written by RStudio and integrates well with their development tools, and the data testing package assertr is written by the organization ROpenSci.

RStudio

RStudio is an integrated development environment (IDE) that supports project organization, package development, documentation, testing and package sharing. It's a GUI that makes R more user-friendly (with help, shortcuts, code syntax highlighting), and, most importantly, supports project organization, testing and development.

RMarkdown

The final step of an analysis is communicating results. Normally, one would write code that outputs results in an excel table, and then write a separate report detailing the data processing and methodology. With RMarkdown, you can write an executable script that also includes descriptions and explanations of each step of the methodology. With confidential data, you can choose which output displays and present the final copy as a static document, and with non-confidential (or synthetic) data, you can provide a dynamic, interactive document that a reader can use to step through each step of the code and results.

Conclusion

The proposal is a data development project: to match TCOD and BR using machine learning. The other half of the proposal is the workflow of that project. The deliverables include the matched dataset, the reproducible workflow and documentation, along with recommendations and lessons from implementing the workflow. The code should include functions that can be easily shared and re-used in other projects.

The focus of this workflow is communication; work is only useful if it's communicated, and communicating the workflow and code is becoming as important as communicating the result itself, if we're interested in improving the process of analysis and research.



tweed1e/matchtools documentation built on May 29, 2019, 10:51 a.m.