This document details the SPEW development process. It is written for people adding code to the SPEW package. After reading, contributors should understand the organization of SPEW and how to contribute. The goal is for all SPEW software to be consistent, robust, and modular.

The process is by no means optimal, and we are interested in improvements. However, we need to guard against introducing error, since SPEW ecosystems are already publically available. Therefore, the SPEW process contains many checks and balances for ensuring new code is as correct as possible.

This document is organized as follows. We first describe SPEW's organization, followed by the R-code structures, and conclude with the documentation system.

SPEW Overview

SPEW serves two purposes:

  1. Generating and releasing synthetic ecosystems
  2. Providing a general-purpose synthetic ecosystem generator

To meet goal 1, we generate and host our synthetic ecosystems on the Olympus computing cluster, hosted by the Pittsburgh Supercomputing Center (PSC). To meet goal 2, we organize SPEW into an R-package, avaialble for users with custom data.

SPEW Organization

All SPEW code is on Github at https://github.com/leerichardson/spew. The olympus/ directory contains all code used for generating and hosting our ecosystems, doc/ includes our formal documentation, and everything else corresponds to the R-package.

Olympus

The olympus/ directory contains code for running SPEW on Olympus, generating reports, collecting and formatting data, analyzing logfiles, and generating the SPEW geographic hierarchy. All of our files are located on Olympus at:

/mnt/beegfs1/data/shared_group_data/syneco.

In this directory, we have the following sub-directories:

Our final ecosystems are available online at: http://data.olympus.psc.edu/syneco/. This website is just an online version of the Olympus directory:

/mnt/lustre0/machines/data.olympus.psc.edu/srv/apache/data/syneco

Meaning that anything we want online, we simply put into this directory. We post finalized versions of our ecosystems using "symlinks" between this directory and /mnt/beegfs1/data/shared_group_data/syneco to avoid copying.

We now briefly explain each subdirectory inside olympus/.

R-Package

The primary reference for our spew R-package is the book R Packages, available online at:

http://r-pkgs.had.co.nz/.

For new contributors, the two most important parts are the R/ folder, where all R-code lives, and tests/, for verifying that our R-code works. The key concept to remember is that all code should have a corresponding test, and all tests should pass before adding code.

We include internal data for verifying our code and tests work, organized as follows:

An example of how this works: For a new version of SPEW, I download the input data into inst/extdata. Then, I write an r-script in data-raw which saves the binary (.rda) file for easy use in data/.

R-Code

We (ideally) follow the style guide available from R-packages at: http://r-pkgs.had.co.nz/r.html. Unfortunately, some spew code was hastily written, so our package is not entirely consistent. Please strive to follow these conventions when adding code. Additionally, please make sure code is adequately commented. We are available to review any new code added to SPEW.

An example of re-factoring existing code is contained in the appendix.

Functions

The R/ directory has files for various groupings of functions. It is important to stress that all R-code in these files should be functions: not scripts. Ideally, functions should work generally, and not be tailored to specific file-formats or Olympus (with obvious exceptions: the R/read.R files only purpose is reading data on Olympus). For new data-sources, a good idea is to think about what a "general" input/output relationship would look like, and pre-process any data into that format before writing the functions.

The main functions are in the spew.R file. This contains the call_spew function for Olympus, spew which works more generally. spew loops over each region specified in the population table, and generates a synthetic ecosystem for each using the spew_place function. We include three different wrapper functions spew_seq, spew_sock, spew_mpi, and spew_mc which generate ecosystems with different parallel back-ends. We include these functions because some ecosystems (e.g. the United States) require more nodes (MPI), and others (e.g. IPUMS) require more memory (MC).

Unit Tests

Key to keeping our functions correct are unit-tests. Unit testing is covered in detail here http://r-pkgs.had.co.nz/tests.html. All functions added to spew should have a corresponding tests, and all test must pass before code is added. If it is the end of the day, and the code still doesn't work, it is better to wait than pushing it to Github without testing.

Integration with Travis CI

The spew R-package has been integrated with the Travis Continous Integration service. Since spew will go on CRAN, one requirement for this is that it must pass R CMD CHECK http://r-pkgs.had.co.nz/check.html without any errors, warnings, or notes.

Practically speaking, this means that everytime someone pushes new code to Github, Travis automatically runs R CMD CHECK on the new code-base. If any errors are introduced, I recieve an e-mail explaining what went wrong. Both Travis and Unit Testing are used to ensure the correctness of our code automatically, so that we can avoid introducing bugs.

Minor Details

my_fun <- function(a, b) {
  if (!requireNamespace("pkg", quietly = TRUE)) {
    stop("Pkg needed for this function to work. Please install it.", call. = FALSE)
  }
}

Documentation

SPEW has two layers of documentation:

  1. R-functions
  2. Formal Documentation for Ecosystem Release.

R-Functions

We document R-functions following http://r-pkgs.had.co.nz/man.html. The documentation should be as concise as possible. Additionally, remember to move to a new-line when writing documentation, and try to keep less than 100 characters per line.

SPEW Ecosystems

Each release of SPEW ecosystems codes with a corresponding .pdf documentation, located at doc/ecosystems/VERSION. While the template and wording remains similar, we need to revise this carefully for each release, and make sure all relevant details are contained here. The following are components we need in each set of .pdf documentation.

New Data Sources

All data-sources should used in this release must be documented here. If you are adding new data, you must document it here. All code for downloading and pre-processing data should be made avaialable at: olympus/data.

New Methodology

All methodology for spew must be detailed here. If you are adding new methodology, it must be formally documented here.

Clarifications for this release

All details of the release must be documented. For example, we might need to record that "For Canada, we adapted the version of spatial sampling because there was no lake data".

Description of output directories

We have a complicated directory for each ecosystem, which should be layed out here.

Appendix

Extraneous details on the SPEW process are given here.

Re-factoring R Code

This section revises an existing spew R-function for compliance with the style guide. We hope this provides inspiration for how to get the code looking good, and eventually revise all code for consistency.

Here is an example function from the R/ipf.R file. This is the IPF-wrapper function, sequentially calling functions implementing smaller pieces of the IPF method. There are three steps involved in the procedure:

  1. Aligning the PUMS microdata with the Marginal tables
  2. Estimating the contingency table
  3. Sampling using the contingency table weights
sample_ipf <- function(n_house, pums_h, pums_p, marginals, alpha = 0, k = .001, 
                       puma_id = NULL, place_id = NULL, do_subset_pums = TRUE) {
                                        # Step 1: Align PUMS with Marginals

    if(do_subset_pums){
        pums <- subset_pums(pums_h = pums_h, pums_p = pums_p, marginals = marginals, puma_id = puma_id)
    } else {
        pums <- pums_h
    }
    pums <- align_pums(pums, marginals)


  # Step 2: Fill in the contingency table
    table <- fill_cont_table(pums = pums, marginals = marginals, place_id = place_id, n_house = n_house)

                                        # Write out the contingency table HERE.

  # Step 3: Sample with contingency table weights 
  households <- sample_with_cont(pums = pums, table = table, alpha = alpha, 
                                 k = k, marginals = marginals)

  return(households)
}

We immediately spot problems with the function:

  1. Some comments have been moved to the far-right corner of the line.
  2. Step 2 of the procedure and the corresponding comment are not aligned.
  3. The syntax surrounding the if statement doesn't follow the spacing section of the Style guide.

Fixing these issues:

sample_ipf <- function(n_house, pums_h, pums_p, marginals, alpha = 0, k = .001, 
                       puma_id = NULL, place_id = NULL, do_subset_pums = TRUE) {
  # Step 1: Align PUMS with Marginals
  pums <- subset_pums(pums_h = pums_h, pums_p = pums_p, marginals = marginals, puma_id = puma_id)
  pums <- align_pums(pums, marginals)

  # Step 2: Fill in the contingency table
  table <- fill_cont_table(pums = pums, marginals = marginals, place_id = place_id, n_house = n_house)
  # Write out the contingency table HERE.

  # Step 3: Sample with contingency table weights 
  households <- sample_with_cont(pums = pums, table = table, alpha = alpha, 
                                 k = k, marginals = marginals)

  return(households)
}

We can extract various principles out of this revision

Consistency

We want spew code following consistent syntax. Not to be pedantic or dogmatic, but rather for keeping consistency across contirbutors. It is ofter difficult to understand someone else's code following a different coding style or syntax. This is highly inefficient, since a lot work goes into parsing, testing, and understanding the code. Following a consistent syntax alleviates this issue.

If-statements

Generally speaking, we want to minimize ad-hoc if-statements in the SPEW code. However, in many situations if-statements are the only method for solving the problem. When if-statements are required, we really want it to be well documented, so anyone reading it knows why it should be there.

A strategy for if-statements:

  1. Avoid ad-hoc if statements if possible
  2. If the if-statement is necessary, include it at the appropriate level (not the top-level wrapper function)
  3. Clearly describe why, if the if-statement is nececssary the if-statement is necessary

Using defaults to add new-features

When adding new features to spew, add as an option in one place, and set the existing methodology to be the default. This way, we only need to update the code in one place, rather than having to propogate changes in multiple functions.

The SPEW algorithm

The core of our SPEW implementation is the function, spew. This section describes the function in detail.

Inputs

SPEW has three types of input data:

  1. Required Data
  2. Supplementary data
  3. Options

Required data refers to the three essential input data-sources

| Data Types | R-variable name | Required Variables | Format | |---|---|---| | Population Counts | pop_table | place_id, puma_id | .csv | | Shapefile | shapefile | place_id | .shp | | Indidivual Microdata | pums_p| puma_id | .csv | | Household Microdata | pums_h | puma_id | .csv |

All spew calls require these four inputs as data.

Supplementary data means any external data required outside of the original data. Right now, we have types:

  1. Schools
  2. Workplaces
  3. Roads
  4. Marginal tables for IPF
  5. Marginal tables for MM

The default is for each of these to be set to NULL. The spew function checks this, and will assign supplementary data only if it is avaialble.

Options are various parameters we tweak during runs of SPEW. For instance, one is "output_type", which determines whether we return the ecosystem either inside the R secction, or write the output.



leerichardson/spew documentation built on May 21, 2019, 1:39 a.m.