Home

/

GitHub

/

r-api-pitch-2.md
In tweed1e/matchtools: Tools For Matching Firms From Different Datasets

R Analysis Workflow Pilot

A pilot development analysis workflow using R, RStudio and RMarkdown, with an application using machine learning to match two datasets. The proposal has two parts: the data development project itself (described in section Statistical methodology), and the workflow supporting that project (described in section Workflow and implementation).

Deliverables:

a matched dataset of shipments and firms
reproducible workflow, code and documentation
recommendations and lessons from implementing the workflow

Schedule:

2018-2019 Proposal timeline: April 1st-June 30th, including some data development that's already underway.

April 1, 2018: continue data development that is already underway
June 29, 2018: deliver workflow code and documentation, and dataset
July 31, 2018: deliver report on lessons learned from implementing workflow

Budget:

| Group and level | Name | Home div. | # days | $ |-----------------|-----------|---------------|--------|------ | EC-04 | Jesse Tweedle | EAD | 45 | (in kind) | EC-?? | Claudiu Motoc | EAD | 3 | (in kind) | ??-?? | Data Science Working Grp | BSMD | 3 | (in kind) | EC-?? | Mark Brown | EAD | 3 | (in kind) | EC-?? | Sarah Swan | EAD | 3 | (in kind) | TOTAL | | | | ??

Building analytical capacity:

I propose a well-defined project (the matching methodology, workflow, and resulting dataset) that directly addresses current or emerging priorities relevant to Statistics Canada users (by providing the a dataset that identifies goods shipments between firms)
This project will help build analytical expertise within Statistics Canada as well as among external user communities by providing easily sharable R code and documentation to implement a machine learning matching algorithm
This project will promote collaboration between program areas as well as with external partners by combining expertise in Economic Analysis Division (EAD), as well as IT and the Data Science Working Group in Business Statistics Methodology Division (BSMD).
This project supports the professional development of project analysts by adding new tools (R, RStudio, Rmarkdown) to our research toolkits, as well as putting priority on developing programming and code documentation skills.

Addressing User Information Needs:

The dataset directly addresses knowledge and/or data gaps relevant to a user community, including ourselves (for intra-firm trade research), Transport Canada (identifying inter-modal shipments), and Innovation, Science, and Economic Development Canada (identifying manufacturing supply chains).
The documentation and workflow portion of the deliverable addresses information needs related to quality assurance and promotes a richer understanding of the strengths and weaknesses of data, and supports internal and/or external users of these data by providing more direct information on the R code that creates the dataset, while the dataset itself includes certainty measures on the matches.
This project evaluates the fitness-for-use of data for specific analytical purposes by providing certainty measures on the matches that analysts can investigate for each application of the dataset.
This project creates a new analytical database (linked files) of trucking shipments matched to origin and destination firms
The project uses the TCOD in a way that substantially enhances their analytical relevance, by identifying which firms are shipping and receiving the shipments.

Experimentation and the application of leading-edge methods:

This project applies leading-edge methods (machine learning application to matching, as well as documentation and workflow using the leading-edge tools in R/RStudio/Rmarkdown) to new data sources to enhance the analytical strength of these data (the TCOD database of shipments, allowing the identification of firm origins and destinations).
This project's approach to code and analysis dissemination via RMarkdown develops a new way of disseminating analytical information to stakeholders that aids in its interpretation and enhances its relevance. The code and documentation helps users in and outside Statistics Canada interpret and use the resulting data and research.

We have shipments with information on origin and destination firms (sometimes names, usually addresses). I want to match these data to firms on the BR, so we can infer supply chains, transportation logistics information, intra-firm vs. inter-firm trade, and more. The basic obstacle in this project is that a large subset of shipments on the TCOD do not have shipper or receiver names, which means I'll rely only on the addresses to match. I use machine learning techniques to evaluate an address-only match model on the subset of the TCOD with names and addresses, and apply that model to predict the out-of-sample subset of the TCOD without addresses.

The results of this project are inputs into economic research projects, transportation projects (Transport division and Transport Canada), and informs re-development of transportation surveys.

Business Register (BR): 2013/01 GSUF, a dataset of firm names, street addresses, postal codes and industries
Transport Commodity Origin and Destination Survey (TCOD): a dataset of trucking shipments; includes names, street addresses and postal codes of shippers and consignees. Focus on year 2012.
Postal code conversion file, 2011 and 2001 versions
FSA shapefile: (from Statcan geo website)

We combine a standard matching methodology with a non-standard out-of-sample match prediction method. The steps are:

A. Standard matching methodology:

BR, TCOD: process and filter each dataset accordingly.
Create subsets to match: T_na, T_a, BR.
Fuzzy blocking on each subset to create candidate matches: Q_na = T_na x BR, and Q_a = T_a x BR.
Predict matches on Q_na using names and addresses (model M_na)

B. Out-of-sample matching predictions:

Predict matches on Q_na using only addresses (model M_a).
Evaluate both using F1 score (combination of precision and recall)
Use M_a model to predict matches on Q_a
Merge matches from Q_na and Q_a back onto TCOD

A dataset of that gives potential matches between the TCOD and BR, along with the model-implied certainty of each of match. A match that uses name and address information is likely to be a certain match, while a match that only uses an address is less certain, especially so if there are two or more business registered at that address. We would like the user of the resulting matched dataset to be able to access and understand this uncertainty.

The workflow for this project is: R for the code, RStudio for the integrated development environment (IDE), and RMarkdown for the documentation/communication.

The data scientist Hilary Parker suggested a data development/analysis workflow should satisfy three conditions: (a) reproducible and auditable, (b) accurate, and (c) transparent. The R/RStudio/RMarkdown combination, when properly implemented, achieves these conditions. (Although these current Statcan analysts have these goals in mind when writing SAS and Stata projects, the R toolkit puts them front and centre and makes them much easier to achieve.)

The R/RStudio/RMarkdown workflow puts these goals first. Writing an R package with these developer tools requires the developer to put organization, documentation and testing first, so the workflow can be reproducible, accurate and transparent.

R, the statistical programming language itself, is powerful and open source, and is one of the standard languages used in data science and statistics. R has several packages for testing; the package testthat is written by RStudio and integrates well with their development tools, and the data testing package assertr is written by the organization ROpenSci.

RStudio is an integrated development environment (IDE) that supports project organization, package development, documentation, testing and package sharing. It's a GUI that makes R more user-friendly (with help, shortcuts, code syntax highlighting), and, most importantly, supports project organization, testing and development.

The final step of an analysis is communicating results. Normally, one would write code that outputs results in an excel table, and then write a separate report detailing the data processing and methodology. With RMarkdown, you can write an executable script that also includes descriptions and explanations of each step of the methodology. With confidential data, you can choose which output displays and present the final copy as a static document, and with non-confidential (or synthetic) data, you can provide a dynamic, interactive document that a reader can use to step through each step of the code and results.

The proposal is a data development project: to match TCOD and BR using machine learning. The other half of the proposal is the workflow of that project. The deliverables include the matched dataset, the reproducible workflow and documentation, along with recommendations and lessons from implementing the workflow. The code should include functions that can be easily shared and re-used in other projects.

The focus of this workflow is communication; work is only useful if it's communicated, and communicating the workflow and code is becoming as important as communicating the result itself, if we're interested in improving the process of analysis and research.

tweed1e/matchtools documentation built on May 29, 2019, 10:51 a.m.

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

tweed1e/matchtools
Tools For Matching Firms From Different Datasets

r-api-pitch-2.md
In tweed1e/matchtools: Tools For Matching Firms From Different Datasets

R Analysis Workflow Pilot

Project summary

Deliverables:

Schedule:

Budget:

Addressing API Scoring Guide

Building analytical capacity:

Addressing User Information Needs:

Experimentation and the application of leading-edge methods:

Statistical methodology

Introduction

Data sources

Methodology

A. Standard matching methodology:

B. Out-of-sample matching predictions:

Results

Workflow and implementation

Why R for this project and workflow?

R

RStudio

RMarkdown

Conclusion

R Package Documentation

Browse R Packages

We want your feedback!

tweed1e/matchtools Tools For Matching Firms From Different Datasets

r-api-pitch-2.md In tweed1e/matchtools: Tools For Matching Firms From Different Datasets

R Analysis Workflow Pilot

Project summary

Deliverables:

Schedule:

Budget:

Addressing API Scoring Guide

Building analytical capacity:

Addressing User Information Needs:

Experimentation and the application of leading-edge methods:

Statistical methodology

Introduction

Data sources

Methodology

A. Standard matching methodology:

B. Out-of-sample matching predictions:

Results

Workflow and implementation

Why R for this project and workflow?

R

RStudio

RMarkdown

Conclusion

R Package Documentation

Browse R Packages

We want your feedback!

tweed1e/matchtools
Tools For Matching Firms From Different Datasets

r-api-pitch-2.md
In tweed1e/matchtools: Tools For Matching Firms From Different Datasets