Home

/

GitHub

/

r-api-pitch.md
In tweed1e/matchtools: Tools For Matching Firms From Different Datasets

R Project Pitch / Analytical Projects Initiative (API) 2018-19

Reframe as "a prototype/pilot development analysis workflow using R, with an application to using ML to match two datasets (that we need for research + data development)." R and its tools helped to accomplish this. (And a decent amount of R expertise already exists in Statcan for various reasons---previous jobs, school, side projects, general interest.)

organization!

[Cite opinionated development analysis workflow from H Parker / @hspter, especially these slides and paper.]

Reproducible and auditable
Executable analysis scripts
Assertive testing of data and assumptions (e.g., validate pkg to check properties of dataset, or assertr
Code review
Modular, tested code (using testthat)

Reproducible and Auditable Executable analysis scripts Defined dependencies Watchers for changed code and data Version control (individual) * Code review

Accurate Modular, tested code Assertive testing of data, assumptions and results * Code review

Collaborative Version control (collaborative) Issue tracking

From Jesse Maegan: "I’ve found it to be most helpful to pilot a workflow or initiative with clear indicators of success as well as scalability and sustainability. Working with a smaller group before an org-wide roll-out helps you suss out (and address) any potential problems, demonstrate success, and get more organizational buy-in.

Ultimately you want people to want to adopt this workflow, not be told that they have to."

success: organized and well-documented finished product.
scalable: R training (so much online---not that it's much easier or different than any other language, there's just so many ways to learn)
can slot in where your normal development workflow goes, same inputs and outputs, or can extend inputs (R data packages, centralized / discoverable data sources) and outputs (R markdown for published documents, git for code)
sustainability: update packages when they're released, IT support for package development / drat
goal here is to make it look so good / easy that everyone wants to try, and some succeed right away

Underlying philosophy: transparency, reproducibility, modernization, discoverability, knowledge transfer
Business objective: match two datasets (a non-standard application of ML and matching, aka record linkage, entity resolution, document retrieval), then do analysis.
Deliverable: an R pkg that encapsulates data development, analysis and evaluation for this project, including documentation and code.
Timeline: Soon-ish.
Budget: mainly just me.
Stakeholders:

Scope? Deliverable? Timeline? Budget? Stakeholders (IT, users, developers, us, data ppl, and ppl that can use the code in the future).

Scope: a single package, a single project. But a very visible example of openness and reproducibility that can provide a template for a workflow...?

Introduction

Match BR + TCOD. First goal is to get it to work. After that, focus on philosophy: transparency, reproducibility, modernization, discoverability, knowledge transfer. So the lessons/skills/code can be applied to more than just one project.

Research objective

We have shipments with information on origin and destination firms (sometimes names, usually addresses). I want to match these data to firms on the BR, so we can infer supply chains, transportation logistics information, intra-firm vs. inter-firm trade, and more. The basic obstacle in this project is that a large subset of shipments on the TCOD do not have shipper or receiver names, which means I'll rely only on the addresses to match. I use machine learning techniques to evaluate an address-only match model on the subset of the TCOD with names and addresses, and apply that model to predict the out-of-sample subset of the TCOD without addresses.

The results of this project are inputs into economic research projects, transportation projects (Transport division and Transport Canada), and informs re-development of transportation surveys.

BR and TCOD. Along with postal codes, geography files, other things.

Name and address standardization. Machine learning. Matching, using na and a models.

A dataset of matches between the TCOD and BR. Inputs into many things.

Non-research objective (aka process improvement)

One metric I use to judge the success of a project I've finished: how much have I improved the tool / process I used to do this project? In other words, if I started this project over again from scratch, how much faster could I finish it? Most research/analytics projects I've been involved in have resulted in very small or no process improvements. I would guess our research workflows have not changed much in many years. [Ask if anyone knows where the data is, request the data, find no or little documentation, struggle through strange variable definitions (imgeocode in IPTF, anyone?), struggle through concordances and classifications, missing variables, and so on., then try to make the best of it, writing code to do one specific thing, and then no one can ever find that code again. Try to get extra Stata packages installed, find they don't work, do the same for R, switch data back and forth from SAS to .csv to .dta, and so on.]

This project is a prototype of a workflow that aims to improve the process of data development and analysis, mainly through transparency and reproducibility.

improve code and documentation; knowing you'll re-use code helps you think harder about future users (including and especially yourself)
- write clear and concise code
- write clear documentation at the time you write the code
- write code that can be adapted and used in similar situations
- write code that follows a common style guide to improve readability and harmonization with existing code
- write code with demos and unit tests to validate each step of the analysis
- clearly organize project, code, documentation and data
improve discoverability and knowledge transfer
- make code, documentation, classifications and concordances (not microdata) publicly available
- easy for users to discover similar methods, data, code, rules, and overall data development strategies
- aids user/developer understanding (they can see the details of the analysis, concordance rules, filtering rules)
improve code improvement process
- make it easy to collaborate with other developers to fix bugs and incorporate improvements
- once a bug is fixed, make it easy for users to update across all existing projects (instead of manually searching / finding bugs from copy/pasted code snippets someone emailed to you)
- make it easy for others to test/deploy code with different situations
- automate the development/analysis update process (add new years of data, apply to different datasets)
reproducibility
- anyone should be able to test code and analysis
- anyone with access to microdata should be able to reproduce results, or see how minor changes could affect analysis
- anyone should be able to apply the code to different situations
- anyone should be able to look at, understand, re-run, improve, adapt, and/or take-over projects from original developer
(flexibility?)
- shouldn't be tied to one language, etc.?
- although I propose R for this, I want to instill a focus on openness and reproducibility that could be translated to whichever language/approach is best for any individual project.

(Tangential benefits: employee professional development, learning specific language and programming things, and make it easier to learn from others' code---also easier to showcase accomplishments publicly.)

This data development/analysis philosophy isn't specific to one language or field. Many of these things could be accomplished in every language. Many businesses, academics and governments incorporate some or all of these points. But modern tech companies take it one step further---Airbnb's internal rbnb package to standardize data science, Twitter's etc., Google's open source contributions, etc.

The goal is to emulate these organizations' focus on openness, which arguably should be more important for a government agency---the increasing focus on openness and transparency, as well as the adoption of digital services across departments (see CDS-SNC). Things should be open---classifications, concordances, anything that's not proprietary should at least be available and discoverable internally (e.g., although we can't release postal code conversion file publicly, we should have easier access to a file and its documentation, or give ourselves a data workflow to get an historically accurate file to use for our own purposes).

For this project, I propose to use:

R for the analysis
RStudio for a integrated development environment (IDE)
an R project/package for organizing / saving / delivering code and data,
and R markdown for publishing documentation and descriptions

There may be opportunities to develop an internal package repository like CRAN (https://cran.r-project.org/) using drat or similar.

R is an open source statistical programming language.

high level like Stata, but much easier to program in
well built, well developed and supported tools
open source, but backed by companies and organizations (R Core, RStudio, Microsoft [MRAN], ROpenSci, etc.)
improving all the time, with data processing (tidyverse), graphics / data visualization (ggplot2), machine learning (glmnet, xgboost, caret, etc.)
supervised? package repository

integrated development environment gives project and package templates that help automate documentation and organizations
GUI tool to help navigate data, code, results, display options

integrate writing, code and results
prioritizes reproducibility (code that makes tables and results is inside the document; re-run the code -> re-write the tables in the document)
prioritizes automation (same reason)
HTML or PDF output
well formatted, easy to read

Built to work together. Head of RStudio led development in most popular R packages (ggplot2, dplyr and the tidyverse).

Also wrote very helpful books and tutorials on beginner and advanced R for data science.

Integrated, easy to learn, easy to advance.

well built, developed tools; open source, but backed by companies (RStudio, etc.), improving all the time.
easy to learn, easy to start, but also lots of more advanced tutorials on different things (ML, data viz, package development, etc.), and also lots of help on regular things (via stackoverflow, etc.)
leveraging huge existing userbase that's willing to help
stats: stackoverflow (R big / fast growing DS language, https://stackoverflow.blog/2017/05/09/introducing-stack-overflow-trends/, http://r4stats.com/articles/popularity/, R v SAS graph of jobs on indeed, I think: https://i1.wp.com/r4stats.com/wp-content/uploads/2017/02/Fig-1c-R-v-SAS-2017-02-18.png)

tweed1e/matchtools documentation built on May 29, 2019, 10:51 a.m.

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

tweed1e/matchtools
Tools For Matching Firms From Different Datasets

r-api-pitch.md
In tweed1e/matchtools: Tools For Matching Firms From Different Datasets

R Project Pitch / Analytical Projects Initiative (API) 2018-19

Outline

More:

Introduction

Research objective

Data sources

Methodology

Results

Non-research objective (aka process improvement)

A list of the important non-research objectives:

Background/general motivation of these principles / philosophy

For this proposal

Why R?

Why RStudio?

Why RMarkdown?

Why these together?

R Package Documentation

Browse R Packages

We want your feedback!

tweed1e/matchtools Tools For Matching Firms From Different Datasets

r-api-pitch.md In tweed1e/matchtools: Tools For Matching Firms From Different Datasets

R Project Pitch / Analytical Projects Initiative (API) 2018-19

Outline

More:

Introduction

Research objective

Data sources

Methodology

Results

Non-research objective (aka process improvement)

A list of the important non-research objectives:

Background/general motivation of these principles / philosophy

For this proposal

Why R?

Why RStudio?

Why RMarkdown?

Why these together?

R Package Documentation

Browse R Packages

We want your feedback!

tweed1e/matchtools
Tools For Matching Firms From Different Datasets

r-api-pitch.md
In tweed1e/matchtools: Tools For Matching Firms From Different Datasets