Table of Contents

CARB grid and HPMS processing (in R)

Nota Bene

This code actually has very little to do with the HPMS data. The main purpose is to estimate the hourly fraction of AADT for each grid cell in california, and that computation is based solely on the grid values from the VDS and WIM data. To see how and where the HPMS data enters into the system, go look at ../grid_merge/lib/query_postgres.js

CARB grid data handling

The CARB grid is a 4km grid covering California. It seems to be the smallest tractable unit used by the California Air Resources Board (CARB) to estimate air quality and emissions. Therefore, it makes sense to use the grid as a basis for generating estimates of mobile source emissions.

This package contains code I use to manipulate the CARB grid data. Essentially, the job is to model hourly changes across the grid, based on observed changes in grids that have highway links (and hourly data).

Feel free to poach any code you want from here, but doing so is probably stupid unless your name is James Marca.

Running this.

Setup config.json

First, set up config.json properly. It should look like.

    "couchdb": {
        "host": "",
        "db": "vdsdata%2fskimmed",
                "password":"mysegreded passzwrd""
        "grid_detectors": "carb%2fgrid%2fstate4k",
        "grid_hpms": "carb%2Fgrid%2Fstate4k%2fhpms%2f2015"
    "postgresql": {
        "host": "",
        "db": "hpmstest",

Install dependencies

To install the required dependencies, run

npm install

in the package directory. There should not be any errors.

Install R dependencies

To install the R dependencies properly, run

Rscript ./RinstallDeps.R

from the package directory. This will install the required R packages local to this package. It will not load them into the global R library location, and so should not be run with root user privileges.

Clean up after old runs

If you are re-running the code after a previous run, it is important to delete any files that are left over. To do that, it is safest to look under "data" and "stash" directories, and delete the year that you are about to redo. For example, for 2015 the file trees look like

  -- California.hwy.2015.RData
  -- California.hpms.2015.RData
  -- 2014
    -- ..
  -- 2015
    -- ..

So to delete the 2015 files, you would run:

rm data/*2015.RData
rm -rf stash/2015

In addition, be sure to set the configuration file value "recheck" to be numerically greater than any previous value. Otherwise, the program will go to couchdb, load up the grid square, see that it already has a model version of (say) 1.0, and decide not to redo the grid square. Alternately, you can just delete the entire year's couchdb database (for example, "carb/grid/state4k/hpms/2015").

Run the precaching program

There is a precaching program that runs the first step of the modeling jobs---assembling and partitioning the data for each day. First it will query the PostgreSQL database specified in the config.json file to generate grid "maps" of HPMS and detector-based data. Then it will partition those grids into clusters according to the clustering rules specified in the code.

It isn't strictly necessary to run this step before the general modeling run. However, doing so will allow the modeling job to jump right into modeling. It is also possible to share the data and stash directories between multiple machines if the modeling job is being spread to different computers.

Run it as follows:

node lib/trigger_R_preload_calif.js --startmonth=1  --jobs 1 --year 2012  > calif_preload_2012.log 2>&1 &

In the above case, the main calling routine (the node.js job) will write its output to calif_preload_2012.log, and each of the spawned R jobs will write its output to a file in the log directory. These files will be in the form California_preload_YYYY_M_D.log, where YYYY is the four digit year, M is the month number (from 1 to 12), and D is the day number (from 1 to 31).

This can be run at the same time as the modeling program, but in that case one should wait until the preload program has completed as many days as there are jobs in the regular modeling run. For example, if there will be 4 parallel jobs on the modeling run, then make sure the output log file for California_preload_2012_1_4.log is written out and completed, and that the main log file shows the line

got exit: 10, for  California 2012 1 4

If the modeling job will have 20 simultaneous jobs, then obviously one would have to wait for January 20th to be completed in the preload program. The modeling programs very quickly get bogged down in modeling, while this program only loads data PostgreSQL and CouchDB for the modeling runs, so as long as the preload program is working on a day beyond the initial list of days, the modeling job will never catch up and duplicate effort.

The precaching program will fill up the directories data and stash underneath the root directory of this package. If you are short on space, you can create a symlink to a bigger drive prior to running this program.

For 2012, a month of data in the stash directory takes up about 22MB, so a year of data should take up less than 300MB, although that number can rise in future years if more grid cells are covered by highway detector data and/or HPMS data.

Run the modeling program

There are two options here. First, you can run a model for each airbasin. Second, you can run for all of California at once. At this point (mid-2016), it is strongly recommended to run the all-of-california version of the clustering and modeling run.

What both programs do is to cluster the grid cells in the region with valid freeway data for the given day, associate HPMS-data grid cells, and then run the spatial-temporal modeling and predicting steps. If you run it on an airbasin boundary, then airbasins without highway detectors will not have any modeling done in them, and you'll have to run the all-California model anyway.

If you start with the all-California modeling step, rather than the per-airbasin step, then you will effectively end up with bigger clusters of grid cells, which will slow down the spatial-temporal modeling and predicting, but in the end the result is the same as the per-airbasin way.

So, to run the all-of-California model/predict code, do the following (assuming that you have 6 CPUs on your machine and approximately 4GB available per CPU):

node lib/trigger_R_gridwork_calif.js --startmonth 1  --jobs 6 --year 2012  > calif2012.log 2>&1 &

Obviously, if the machine has sufficient RAM and CPU cores to run 20 jobs, the above command line would have jobs=20, not 6. However, it is best to err on the side of fewer parallel jobs, as the RAM requirements can lead to a crash and a system shut-down in some cases.

Also, note that you have to change the year from 2012 to whatever year you are processing.

Pre-run test runs

If you want to run a shorter test prior to running the full blown modeling step, then you can tweak the code a little bit to do so.

Open up ./lib/trigger_R_gridwork_calif.js in an editor, and scroll down to about line 129, to where you see the following lines of code:

// for debugging, do a few days only
// endymd = new Date(year, startmonth, 3, 0, 0, 0)
for( ymd = new Date(year, startmonth, 1, 0, 0, 0);

You should uncomment the // endymd = new Date... line. What this does is set up the job to stop after just 2 days of processing. It will only process startmonth day 1, and startmonth day 2. Then you can inspect the log files (look under the log/... directory) and make sure that the program is doing what you expect it to do.

When everything looks good, don't forget to put back in the comment on the above line.



This is where my node.js files live.


This program is a command-line program that will fire off one or more R jobs. It used to be the only real way to get parallelism from the R code, by passing a value greater than 1 to the -j option. (jobs=2, for example). However, now that the R code is using plyr et al, it will generally run parallel jobs that equal half the current number of processors. On some machines, this means you can use -j 2 as an option to occupy all of your cores. On others, the best option is -j 1 and just use half the cores, or manually fiddle with the registerDoMC() command in Rwork/data.model.R program to set it to the desired number of parallel jobs. Usually the problem is that too much RAM is allocated for multiple jobs, so you want to cut back so as to not crash.


This program will pull out grid files from the file system and then copy them into CouchDB. It also computes the AADT for each grid, and saves as part of the save.



Lame dead end, commented out.



Heads off to PostgreSQL to grab a particular grid cell's information and geometry. The results get stashed in the task object under task.grid, and then the callback gets called with the modified task object.


This module will connect to PostgreSQL, load grids as specified, and save to CouchDB as a topojson topology object.


Used in the program flow. Will read the file located at task.file, parse the file as JSON, and save it in the task object as, then call the callback with the modified task object.


This is where all the R scripts are kept. And where the HPMS processing is done


This is the program that is called from trigger_R_gridwork.js. If you want to debug or step through the R, the easiest way to do that is to open this file in an editor, scroll to the bottom, and comment out the line that reads


Then, open up an R console in the directory Rwork, and type


There are a number of environment variables that are expected to be set in order for this program to work properly. First there are three variables that set up the link to the PostgreSQL database. Unfortunately, I have not yet parameterized the database port, so it will default to 5432. I also have not parameterized the databases, as that is a little bit pointless. The code will create a connection to a database called spatialvds, and another to a database called osm. The code expects certain tables to exist in these databases, and that really isn't something I'm ready to parameterize yet.

The next set of environment variables controls connections to CouchDB. The idea behind these is that CouchDB can have a local connection as well as a remote connection. Typically, the remote connection is the master node to which all distributed processes are synchronizing. By connecting to this remote database directly, one can avoid race conditions that might result from the delay in replication from the local db to the remote db.

However, in practice for this code, it is often better to keep both the remote and the local the same, and to assign them both to the local machine. There is at least one place in the code that will connect to the remote machine, so make sure both links are set up to connect to the local address.

Next come variables that govern this run of the code. Specifically, the code will process one air basin and one year at a time.

The air basin variable corresponds to the air basin names from the database. These are:

see also the in Rwork directory


Please note that this project is released with a Contributor Code of Conduct. By participating in this project you agree to abide by its terms.

(And thanks to devtools project for making it easy to include the above statement.)

jmarca/grid_data documentation built on May 17, 2017, 11:29 p.m.