Reframe as "a prototype/pilot development analysis workflow using R, with an application to using ML to match two datasets (that we need for research + data development)." R and its tools helped to accomplish this. (And a decent amount of R expertise already exists in Statcan for various reasons---previous jobs, school, side projects, general interest.)
[Cite opinionated development analysis workflow from H Parker / @hspter, especially these slides and paper.]
validate
pkg to check properties of dataset, or assertr
testthat
)Reproducible and Auditable Executable analysis scripts Defined dependencies Watchers for changed code and data Version control (individual) * Code review
Accurate Modular, tested code Assertive testing of data, assumptions and results * Code review
Collaborative Version control (collaborative) Issue tracking
From Jesse Maegan: "I’ve found it to be most helpful to pilot a workflow or initiative with clear indicators of success as well as scalability and sustainability. Working with a smaller group before an org-wide roll-out helps you suss out (and address) any potential problems, demonstrate success, and get more organizational buy-in.
Ultimately you want people to want to adopt this workflow, not be told that they have to."
Scope? Deliverable? Timeline? Budget? Stakeholders (IT, users, developers, us, data ppl, and ppl that can use the code in the future).
Scope: a single package, a single project. But a very visible example of openness and reproducibility that can provide a template for a workflow...?
Match BR + TCOD. First goal is to get it to work. After that, focus on philosophy: transparency, reproducibility, modernization, discoverability, knowledge transfer. So the lessons/skills/code can be applied to more than just one project.
We have shipments with information on origin and destination firms (sometimes names, usually addresses). I want to match these data to firms on the BR, so we can infer supply chains, transportation logistics information, intra-firm vs. inter-firm trade, and more. The basic obstacle in this project is that a large subset of shipments on the TCOD do not have shipper or receiver names, which means I'll rely only on the addresses to match. I use machine learning techniques to evaluate an address-only match model on the subset of the TCOD with names and addresses, and apply that model to predict the out-of-sample subset of the TCOD without addresses.
The results of this project are inputs into economic research projects, transportation projects (Transport division and Transport Canada), and informs re-development of transportation surveys.
BR and TCOD. Along with postal codes, geography files, other things.
Name and address standardization. Machine learning. Matching, using na
and a
models.
A dataset of matches between the TCOD and BR. Inputs into many things.
One metric I use to judge the success of a project I've finished: how much have I improved the tool / process I used to do this project? In other words, if I started this project over again from scratch, how much faster could I finish it? Most research/analytics projects I've been involved in have resulted in very small or no process improvements. I would guess our research workflows have not changed much in many years. [Ask if anyone knows where the data is, request the data, find no or little documentation, struggle through strange variable definitions (imgeocode
in IPTF, anyone?), struggle through concordances and classifications, missing variables, and so on., then try to make the best of it, writing code to do one specific thing, and then no one can ever find that code again. Try to get extra Stata packages installed, find they don't work, do the same for R, switch data back and forth from SAS to .csv to .dta, and so on.]
This project is a prototype of a workflow that aims to improve the process of data development and analysis, mainly through transparency and reproducibility.
(Tangential benefits: employee professional development, learning specific language and programming things, and make it easier to learn from others' code---also easier to showcase accomplishments publicly.)
This data development/analysis philosophy isn't specific to one language or field. Many of these things could be accomplished in every language. Many businesses, academics and governments incorporate some or all of these points. But modern tech companies take it one step further---Airbnb's internal rbnb
package to standardize data science, Twitter's etc., Google's open source contributions, etc.
The goal is to emulate these organizations' focus on openness, which arguably should be more important for a government agency---the increasing focus on openness and transparency, as well as the adoption of digital services across departments (see CDS-SNC). Things should be open---classifications, concordances, anything that's not proprietary should at least be available and discoverable internally (e.g., although we can't release postal code conversion file publicly, we should have easier access to a file and its documentation, or give ourselves a data workflow to get an historically accurate file to use for our own purposes).
For this project, I propose to use:
There may be opportunities to develop an internal package repository like CRAN (https://cran.r-project.org/) using drat
or similar.
R is an open source statistical programming language.
high level like Stata, but much easier to program in
well built, well developed and supported tools
tidyverse
), graphics / data visualization (ggplot2
), machine learning (glmnet
, xgboost
, caret
, etc.)Built to work together. Head of RStudio led development in most popular R packages (ggplot2
, dplyr
and the tidyverse
).
Also wrote very helpful books and tutorials on beginner and advanced R for data science.
Integrated, easy to learn, easy to advance.
leveraging huge existing userbase that's willing to help
stats: stackoverflow (R big / fast growing DS language, https://stackoverflow.blog/2017/05/09/introducing-stack-overflow-trends/, http://r4stats.com/articles/popularity/, R v SAS graph of jobs on indeed, I think: https://i1.wp.com/r4stats.com/wp-content/uploads/2017/02/Fig-1c-R-v-SAS-2017-02-18.png)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.