A pilot development analysis workflow using R, RStudio and RMarkdown, with an application using machine learning to match two datasets. The proposal has two parts: the data development project itself (described in section Statistical methodology), and the workflow supporting that project (described in section Workflow and implementation).
2018-2019 Proposal timeline: April 1st-June 30th, including some data development that's already underway.
| Group and level | Name | Home div. | # days | $ |-----------------|-----------|---------------|--------|------ | EC-04 | Jesse Tweedle | EAD | 45 | (in kind) | EC-?? | Claudiu Motoc | EAD | 3 | (in kind) | ??-?? | Data Science Working Grp | BSMD | 3 | (in kind) | EC-?? | Mark Brown | EAD | 3 | (in kind) | EC-?? | Sarah Swan | EAD | 3 | (in kind) | TOTAL | | | | ??
We have shipments with information on origin and destination firms (sometimes names, usually addresses). I want to match these data to firms on the BR, so we can infer supply chains, transportation logistics information, intra-firm vs. inter-firm trade, and more. The basic obstacle in this project is that a large subset of shipments on the TCOD do not have shipper or receiver names, which means I'll rely only on the addresses to match. I use machine learning techniques to evaluate an address-only match model on the subset of the TCOD with names and addresses, and apply that model to predict the out-of-sample subset of the TCOD without addresses.
The results of this project are inputs into economic research projects, transportation projects (Transport division and Transport Canada), and informs re-development of transportation surveys.
We combine a standard matching methodology with a non-standard out-of-sample match prediction method. The steps are:
T_na
, T_a
, BR
.Q_na = T_na x BR
, and Q_a = T_a x BR
.Q_na
using names and addresses (model M_na
)Q_na
using only addresses (model M_a
).M_a
model to predict matches on Q_a
Q_na
and Q_a
back onto TCODA dataset of that gives potential matches between the TCOD and BR, along with the model-implied certainty of each of match. A match that uses name and address information is likely to be a certain match, while a match that only uses an address is less certain, especially so if there are two or more business registered at that address. We would like the user of the resulting matched dataset to be able to access and understand this uncertainty.
The workflow for this project is: R for the code, RStudio for the integrated development environment (IDE), and RMarkdown for the documentation/communication.
The data scientist Hilary Parker suggested a data development/analysis workflow should satisfy three conditions: (a) reproducible and auditable, (b) accurate, and (c) transparent. The R/RStudio/RMarkdown combination, when properly implemented, achieves these conditions. (Although these current Statcan analysts have these goals in mind when writing SAS and Stata projects, the R toolkit puts them front and centre and makes them much easier to achieve.)
The R/RStudio/RMarkdown workflow puts these goals first. Writing an R package with these developer tools requires the developer to put organization, documentation and testing first, so the workflow can be reproducible, accurate and transparent.
R, the statistical programming language itself, is powerful and open source, and is one of the standard languages used in data science and statistics. R has several packages for testing; the package testthat
is written by RStudio and integrates well with their development tools, and the data testing package assertr
is written by the organization ROpenSci.
RStudio is an integrated development environment (IDE) that supports project organization, package development, documentation, testing and package sharing. It's a GUI that makes R more user-friendly (with help, shortcuts, code syntax highlighting), and, most importantly, supports project organization, testing and development.
The final step of an analysis is communicating results. Normally, one would write code that outputs results in an excel table, and then write a separate report detailing the data processing and methodology. With RMarkdown, you can write an executable script that also includes descriptions and explanations of each step of the methodology. With confidential data, you can choose which output displays and present the final copy as a static document, and with non-confidential (or synthetic) data, you can provide a dynamic, interactive document that a reader can use to step through each step of the code and results.
The proposal is a data development project: to match TCOD and BR using machine learning. The other half of the proposal is the workflow of that project. The deliverables include the matched dataset, the reproducible workflow and documentation, along with recommendations and lessons from implementing the workflow. The code should include functions that can be easily shared and re-used in other projects.
The focus of this workflow is communication; work is only useful if it's communicated, and communicating the workflow and code is becoming as important as communicating the result itself, if we're interested in improving the process of analysis and research.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.