To process the data in this package from source, first run process_raw.py
to download, decompress, and extract the relevant columns of data for each vital statistics file. Then, execute the process_reduced.R
.
The birth certificate records that we're working with span decades, and have gone through numerous changes throughout that time. Some information that used to be available is no longer included in the public data sets (e.g. state of occurence), some new information has been added (e.g delivery method), and many of the fields have undergone transformations over time (e.g. place of delivery used to include "En route or born on arrival (BOA)", but this value was dropped from the records in 1988). None of this is terribly problematic when analysis is performed on only one or two years of records, but spanning the entire length of these public records requires complex solutions.
Minor fluctuations, such as a change in "tape" position or a change in field label without a change in values, are addressed through data-raw/dictionary.json
and data-raw/process.py
. More complex changes (e.g. what do we do now that the code for Oregon changed from 43
to OR
?) are handled via logic contained in data-raw/process_reduced.R
.
The birth certificate data sets from 1968 to 2014 amount to over 5 GB when compressed. Simultaneous decompression of this data is problematic on the typical workstation, and even after aggressive pruning of columns, loading hundreds of millions of records directly into memory (a requirement for analysis in R) will overflow most workstations.
This issue is solved via a multi-step data processing pipeline that incrementally decompresses the raw birth record data, prunes columns, and then reduces rows after equivalent values are mapped across years. The result is a data set which can easily be shared, but still rich enough to perform meaningful analysis.
The pipeline is not executed during the R-package build, but needs to be executed separately as a precursor step to generate the data sets that are used in the package. This can take many hours to complete, and relies python 3, along with several linux based tool utilities.
If you wish to make changes to the births
data that are generated by this process and shipped with the package, please contact the author for support.
Due to propriety file compression reasons, the native zipfile
package in python is not able to unzip all of the data sets provided by the CDC. Instead, the subprocess
package is used to make an external call to the linux 7z
utility. If your environment doesn't support this, you'll need to do some monkey work to decompress all of the files yourself, or modify the scripts to use your local decompression tool. In addition to zipping files from python, a linux call is made from R to unzip and read files directly into memory with the zcat
utility. You can try to figure all this out on your own, or ask the author for help.
The publishers of the vital statistics data have varied their compression scheme over the years, which has caused some challenges in automating the extraction of files. Presently, the most reliable solution seems to be the using the 7zip extraction tool by dropping down to the Linux CLI. Obviously, this requires installation of the 7zip tool in your Linux environment to work.
sudo apt install p7zip-full
Python performs three main functions in this process:
Obtain raw data sets from the CDC FTP servers
Unpack raw data sets (using external system utilities)
Read raw data set line-by-line, and write only the fields that we are interested in keeping, as specified by the data-raw/dictionary.json
file
Compress the much reduced data set
While all of the scripts necessary to perform these functions are included in this repository, they are not necessarily configured in a way that is easy to reproduce (e.g. load into an R package). To perform raw data processing on this project, it is recommended that you download the latest Python 3 Anaconda distribution, and install each of the packages imported by process_raw.py
using the conda
utility.
At this point R is then used to read, transform, stitch annual data sets together, and then further reduce records. This includes extensive business logic in the process_reduced.R
.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.