In leerichardson/spew: SPEW Framework for Generating Synthetic Ecosystems

This document details the SPEW development process. It is written for people adding code to the SPEW package. After reading, contributors should understand the organization of SPEW and how to contribute. The goal is for all SPEW software to be consistent, robust, and modular.

The process is by no means optimal, and we are interested in improvements. However, we need to guard against introducing error, since SPEW ecosystems are already publically available. Therefore, the SPEW process contains many checks and balances for ensuring new code is as correct as possible.

This document is organized as follows. We first describe SPEW's organization, followed by the R-code structures, and conclude with the documentation system.

SPEW Overview

SPEW serves two purposes:

Generating and releasing synthetic ecosystems
Providing a general-purpose synthetic ecosystem generator

To meet goal 1, we generate and host our synthetic ecosystems on the Olympus computing cluster, hosted by the Pittsburgh Supercomputing Center (PSC). To meet goal 2, we organize SPEW into an R-package, avaialble for users with custom data.

SPEW Organization

All SPEW code is on Github at https://github.com/leerichardson/spew. The olympus/ directory contains all code used for generating and hosting our ecosystems, doc/ includes our formal documentation, and everything else corresponds to the R-package.

Olympus

The olympus/ directory contains code for running SPEW on Olympus, generating reports, collecting and formatting data, analyzing logfiles, and generating the SPEW geographic hierarchy. All of our files are located on Olympus at:

/mnt/beegfs1/data/shared_group_data/syneco.

In this directory, we have the following sub-directories:

spew/: Copy of the current spew-software hosted on Github. With devtools, this copy allows us to quickly update SPEW code to run on Olympus, by simply using: devtools::load_all("/mnt/beegfs1/data/shared_group_data/syneco") (e.g. olympus/run_spew.R). I use the program (Sublime SFTP) to edit files on Olympus locally, which saves the hassle of SCP's files back and forth. A git repository is set up in this directory, so you can commit, push, etc. just as you can locally.
spew_NUMBER/: Releases of spew synthetic populations. For example, spew_1.2.0 is the release of version 1.2.0 of our synthetic ecosysystems.
spew_input/: Input data used in our synthetic ecosystems. Recently split into its own directory to avoid duplication.
old/: Old files we no longer use. The most useful part of this directory is old/spew_olympus/getting_data, which contains originally downloaded data-sources.

Our final ecosystems are available online at: http://data.olympus.psc.edu/syneco/. This website is just an online version of the Olympus directory:

/mnt/lustre0/machines/data.olympus.psc.edu/srv/apache/data/syneco

Meaning that anything we want online, we simply put into this directory. We post finalized versions of our ecosystems using "symlinks" between this directory and /mnt/beegfs1/data/shared_group_data/syneco to avoid copying.

We now briefly explain each subdirectory inside olympus/.

call_spew/: Scripts to call the SPEW ecosystems, with our directory structure, on Olympus.
data/: Ideally contains all code for downloading and pre-processing data. Should provide a "record" of how we can all data into spew_input/. We still want to merge files from old/spew_olympus
logfiles/:Extracting and analyzing logfiles from SPEW runs, used to generate figures in the paper.
reports/: Generating diagnostic reports
spew_hierarchy/: Creating our directory structure
misc/: Everything else, including for custom ecosystems (e.g. Canada)

R-Package

The primary reference for our spew R-package is the book R Packages, available online at:

http://r-pkgs.had.co.nz/.

For new contributors, the two most important parts are the R/ folder, where all R-code lives, and tests/, for verifying that our R-code works. The key concept to remember is that all code should have a corresponding test, and all tests should pass before adding code.

We include internal data for verifying our code and tests work, organized as follows:

data/: Formatted data we can use for our functions. Data in this directory can be loaded using data(tartanville).
data-raw/: R-scripts used to generate data located in the data/ folder.
inst/extdata: Data with same input structure as Olympus. Used to test our IO functions locally.

An example of how this works: For a new version of SPEW, I download the input data into inst/extdata. Then, I write an r-script in data-raw which saves the binary (.rda) file for easy use in data/.

R-Code

We (ideally) follow the style guide available from R-packages at: http://r-pkgs.had.co.nz/r.html. Unfortunately, some spew code was hastily written, so our package is not entirely consistent. Please strive to follow these conventions when adding code. Additionally, please make sure code is adequately commented. We are available to review any new code added to SPEW.

An example of re-factoring existing code is contained in the appendix.

Functions

The R/ directory has files for various groupings of functions. It is important to stress that all R-code in these files should be functions: not scripts. Ideally, functions should work generally, and not be tailored to specific file-formats or Olympus (with obvious exceptions: the R/read.R files only purpose is reading data on Olympus). For new data-sources, a good idea is to think about what a "general" input/output relationship would look like, and pre-process any data into that format before writing the functions.

The main functions are in the spew.R file. This contains the call_spew function for Olympus, spew which works more generally. spew loops over each region specified in the population table, and generates a synthetic ecosystem for each using the spew_place function. We include three different wrapper functions spew_seq, spew_sock, spew_mpi, and spew_mc which generate ecosystems with different parallel back-ends. We include these functions because some ecosystems (e.g. the United States) require more nodes (MPI), and others (e.g. IPUMS) require more memory (MC).

Unit Tests

Key to keeping our functions correct are unit-tests. Unit testing is covered in detail here http://r-pkgs.had.co.nz/tests.html. All functions added to spew should have a corresponding tests, and all test must pass before code is added. If it is the end of the day, and the code still doesn't work, it is better to wait than pushing it to Github without testing.

Integration with Travis CI

The spew R-package has been integrated with the Travis Continous Integration service. Since spew will go on CRAN, one requirement for this is that it must pass R CMD CHECK http://r-pkgs.had.co.nz/check.html without any errors, warnings, or notes.

Practically speaking, this means that everytime someone pushes new code to Github, Travis automatically runs R CMD CHECK on the new code-base. If any errors are introduced, I recieve an e-mail explaining what went wrong. Both Travis and Unit Testing are used to ensure the correctness of our code automatically, so that we can avoid introducing bugs.

Minor Details

If using a function from another package, use package::function. For example, if we are using the spsample function from package sp, write sp::spsample.
If using a "Suggests" package, use a requireNamespace conditional before calling (example from the R-packages book):

my_fun <- function(a, b) {
  if (!requireNamespace("pkg", quietly = TRUE)) {
    stop("Pkg needed for this function to work. Please install it.", call. = FALSE)
  }
}

Documentation

SPEW has two layers of documentation:

R-functions
Formal Documentation for Ecosystem Release.

R-Functions

We document R-functions following http://r-pkgs.had.co.nz/man.html. The documentation should be as concise as possible. Additionally, remember to move to a new-line when writing documentation, and try to keep less than 100 characters per line.

SPEW Ecosystems

Each release of SPEW ecosystems codes with a corresponding .pdf documentation, located at doc/ecosystems/VERSION. While the template and wording remains similar, we need to revise this carefully for each release, and make sure all relevant details are contained here. The following are components we need in each set of .pdf documentation.

New Data Sources

All data-sources should used in this release must be documented here. If you are adding new data, you must document it here. All code for downloading and pre-processing data should be made avaialable at: olympus/data.

New Methodology

All methodology for spew must be detailed here. If you are adding new methodology, it must be formally documented here.

Clarifications for this release

All details of the release must be documented. For example, we might need to record that "For Canada, we adapted the version of spatial sampling because there was no lake data".

Description of output directories

We have a complicated directory for each ecosystem, which should be layed out here.

Appendix

Extraneous details on the SPEW process are given here.

Re-factoring R Code

This section revises an existing spew R-function for compliance with the style guide. We hope this provides inspiration for how to get the code looking good, and eventually revise all code for consistency.

Here is an example function from the R/ipf.R file. This is the IPF-wrapper function, sequentially calling functions implementing smaller pieces of the IPF method. There are three steps involved in the procedure:

Aligning the PUMS microdata with the Marginal tables
Estimating the contingency table
Sampling using the contingency table weights

sample_ipf <- function(n_house, pums_h, pums_p, marginals, alpha = 0, k = .001, 
                       puma_id = NULL, place_id = NULL, do_subset_pums = TRUE) {
                                        # Step 1: Align PUMS with Marginals

    if(do_subset_pums){
        pums <- subset_pums(pums_h = pums_h, pums_p = pums_p, marginals = marginals, puma_id = puma_id)
    } else {
        pums <- pums_h
    }
    pums <- align_pums(pums, marginals)


  # Step 2: Fill in the contingency table
    table <- fill_cont_table(pums = pums, marginals = marginals, place_id = place_id, n_house = n_house)

                                        # Write out the contingency table HERE.

  # Step 3: Sample with contingency table weights 
  households <- sample_with_cont(pums = pums, table = table, alpha = alpha, 
                                 k = k, marginals = marginals)

  return(households)
}

We immediately spot problems with the function:

Some comments have been moved to the far-right corner of the line.
Step 2 of the procedure and the corresponding comment are not aligned.
The syntax surrounding the if statement doesn't follow the spacing section of the Style guide.

Fixing these issues:

sample_ipf <- function(n_house, pums_h, pums_p, marginals, alpha = 0, k = .001, 
                       puma_id = NULL, place_id = NULL, do_subset_pums = TRUE) {
  # Step 1: Align PUMS with Marginals
  pums <- subset_pums(pums_h = pums_h, pums_p = pums_p, marginals = marginals, puma_id = puma_id)
  pums <- align_pums(pums, marginals)

  # Step 2: Fill in the contingency table
  table <- fill_cont_table(pums = pums, marginals = marginals, place_id = place_id, n_house = n_house)
  # Write out the contingency table HERE.

  # Step 3: Sample with contingency table weights 
  households <- sample_with_cont(pums = pums, table = table, alpha = alpha, 
                                 k = k, marginals = marginals)

  return(households)
}

We can extract various principles out of this revision

Consistency

We want spew code following consistent syntax. Not to be pedantic or dogmatic, but rather for keeping consistency across contirbutors. It is ofter difficult to understand someone else's code following a different coding style or syntax. This is highly inefficient, since a lot work goes into parsing, testing, and understanding the code. Following a consistent syntax alleviates this issue.

If-statements

Generally speaking, we want to minimize ad-hoc if-statements in the SPEW code. However, in many situations if-statements are the only method for solving the problem. When if-statements are required, we really want it to be well documented, so anyone reading it knows why it should be there.

A strategy for if-statements:

Avoid ad-hoc if statements if possible
If the if-statement is necessary, include it at the appropriate level (not the top-level wrapper function)
Clearly describe why, if the if-statement is nececssary the if-statement is necessary

Using defaults to add new-features

When adding new features to spew, add as an option in one place, and set the existing methodology to be the default. This way, we only need to update the code in one place, rather than having to propogate changes in multiple functions.

The SPEW algorithm

The core of our SPEW implementation is the function, spew. This section describes the function in detail.

Inputs

SPEW has three types of input data:

Required Data
Supplementary data
Options

Required data refers to the three essential input data-sources

| Data Types | R-variable name | Required Variables | Format | |---|---|---| | Population Counts | pop_table | place_id, puma_id | .csv | | Shapefile | shapefile | place_id | .shp | | Indidivual Microdata | pums_p| puma_id | .csv | | Household Microdata | pums_h | puma_id | .csv |

All spew calls require these four inputs as data.

Supplementary data means any external data required outside of the original data. Right now, we have types:

Schools
Workplaces
Roads
Marginal tables for IPF
Marginal tables for MM

The default is for each of these to be set to NULL. The spew function checks this, and will assign supplementary data only if it is avaialble.

Options are various parameters we tweak during runs of SPEW. For instance, one is "output_type", which determines whether we return the ecosystem either inside the R secction, or write the output.

leerichardson/spew documentation built on May 21, 2019, 1:39 a.m.

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

leerichardson/spew
SPEW Framework for Generating Synthetic Ecosystems

In leerichardson/spew: SPEW Framework for Generating Synthetic Ecosystems

SPEW Overview

SPEW Organization

Olympus

R-Package

R-Code

Functions

Unit Tests

Integration with Travis CI

Minor Details

Documentation

R-Functions

SPEW Ecosystems

New Data Sources

New Methodology

Clarifications for this release

Description of output directories

Appendix

Re-factoring R Code

Consistency

If-statements

Using defaults to add new-features

The SPEW algorithm

Inputs

R Package Documentation

Browse R Packages

We want your feedback!

leerichardson/spew SPEW Framework for Generating Synthetic Ecosystems

In leerichardson/spew: SPEW Framework for Generating Synthetic Ecosystems

SPEW Overview

SPEW Organization

Olympus

R-Package

R-Code

Functions

Unit Tests

Integration with Travis CI

Minor Details

Documentation

R-Functions

SPEW Ecosystems

New Data Sources

New Methodology

Clarifications for this release

Description of output directories

Appendix

Re-factoring R Code

Consistency

If-statements

Using defaults to add new-features

The SPEW algorithm

Inputs

R Package Documentation

Browse R Packages

We want your feedback!

leerichardson/spew
SPEW Framework for Generating Synthetic Ecosystems