data-raw/README.md

Instructions

To process the data in this package from source, first run process_raw.py to download, decompress, and extract the relevant columns of data for each vital statistics file. Then, execute the process_reduced.R.

Definition Challenges

The birth certificate records that we're working with span decades, and have gone through numerous changes throughout that time. Some information that used to be available is no longer included in the public data sets (e.g. state of occurence), some new information has been added (e.g delivery method), and many of the fields have undergone transformations over time (e.g. place of delivery used to include "En route or born on arrival (BOA)", but this value was dropped from the records in 1988). None of this is terribly problematic when analysis is performed on only one or two years of records, but spanning the entire length of these public records requires complex solutions.

Minor fluctuations, such as a change in "tape" position or a change in field label without a change in values, are addressed through data-raw/dictionary.json and data-raw/process.py. More complex changes (e.g. what do we do now that the code for Oregon changed from 43 to OR?) are handled via logic contained in data-raw/process_reduced.R.

Technical Challenges

The birth certificate data sets from 1968 to 2014 amount to over 5 GB when compressed. Simultaneous decompression of this data is problematic on the typical workstation, and even after aggressive pruning of columns, loading hundreds of millions of records directly into memory (a requirement for analysis in R) will overflow most workstations.

This issue is solved via a multi-step data processing pipeline that incrementally decompresses the raw birth record data, prunes columns, and then reduces rows after equivalent values are mapped across years. The result is a data set which can easily be shared, but still rich enough to perform meaningful analysis.

The pipeline is not executed during the R-package build, but needs to be executed separately as a precursor step to generate the data sets that are used in the package. This can take many hours to complete, and relies python 3, along with several linux based tool utilities.

If you wish to make changes to the births data that are generated by this process and shipped with the package, please contact the author for support.

Tool Challenges

Due to propriety file compression reasons, the native zipfile package in python is not able to unzip all of the data sets provided by the CDC. Instead, the subprocess package is used to make an external call to the linux 7z utility. If your environment doesn't support this, you'll need to do some monkey work to decompress all of the files yourself, or modify the scripts to use your local decompression tool. In addition to zipping files from python, a linux call is made from R to unzip and read files directly into memory with the zcat utility. You can try to figure all this out on your own, or ask the author for help.

Environments

Linux

The publishers of the vital statistics data have varied their compression scheme over the years, which has caused some challenges in automating the extraction of files. Presently, the most reliable solution seems to be the using the 7zip extraction tool by dropping down to the Linux CLI. Obviously, this requires installation of the 7zip tool in your Linux environment to work.

sudo apt install p7zip-full

Python

Python performs three main functions in this process:

  1. Obtain raw data sets from the CDC FTP servers

  2. Unpack raw data sets (using external system utilities)

  3. Read raw data set line-by-line, and write only the fields that we are interested in keeping, as specified by the data-raw/dictionary.json file

  4. Compress the much reduced data set

While all of the scripts necessary to perform these functions are included in this repository, they are not necessarily configured in a way that is easy to reproduce (e.g. load into an R package). To perform raw data processing on this project, it is recommended that you download the latest Python 3 Anaconda distribution, and install each of the packages imported by process_raw.py using the conda utility.

R

At this point R is then used to read, transform, stitch annual data sets together, and then further reduce records. This includes extensive business logic in the process_reduced.R.



Mikuana/vitalstatistics documentation built on May 7, 2019, 4:57 p.m.