In USGS-VIZLAB/vizlab: Utilities for building online data visualizations

knitr::opts_chunk$set(echo = TRUE)

Jumping in!

The vignette "Vizlab Example" gives a good overview on the complete process for initializing and creating a visualization product from vizlab. However, much of the infrastructure set up in this package is to improve the way we can collaborate during the visualization creation.

Github

During our collaboration "sprints", we first assume all contributors are not only comfortable with git, but also the specific git-workflow that our team uses. Much more detail is available on our Version Control training page. To summarize, we have a canonical "upstream" repository that each member forks Team members are encouraged to create a branch off their fork for each distinct task. When a task is complete, they submit a pull request to the canonical repository. A different team member should review the pull request before merging onto the master branch. We tend to heavily take advantage of Github's Issues and project task management. Ideally, one pull request should correspond to one Issue, and that Issue can be closed with the merged pull request.

Assign a Task

Early on in the vizlab development, we made a data visualization of water use in the United States. The final product can be seen here:

https://owi.usgs.gov/vizlab/water-use/

The canonical GitHub repository for that work is here:

https://github.com/USGS-VIZLAB/water-use

Let's use this project to create a fictional task, and walk through the process of contributing to the project. Our fictional task is to change the way we "clean"" the national water use data. When we were first were given the data, we had no consistent Industrial or Thermoelectric values of water use data before 1960. However, in our fictional example, let's say we were provided with a csv file with 1955 Thermoelectric water use data by state. So now our fictional project lead has created two tasks, and assigned these tasks to us:

Task #1. Pull in the data file thermoelectric_1955.csv

Task #2. Merge that data into the existing cleaned national state data.

The following sections describe the workflow for submitting code to include in the water-use project to complete these tasks. It is assumed we have already forked the repository and checked it out locally. Assuming the project is already fully in-swing, you should be able to jump in after loading the vizlab package:

library(vizlab)

You can build the visualization using the vizmake function:

vizmake()

Task 1: Pull in new data

Our first task was to add new (fictional) data to the project. Perhaps the data is sensitive, therefore, we have created a space on ScienceBase https://www.sciencebase.gov, and managed the permissions so only our team has read/write abilities. Because of the sensitive data, authentication with sciencebase is required. See the dssecrets package README for details (you may need to talk with a DS team member to get access). We established a service account for ScienceBase that should be used if at all possible, otherwise AD login is supported.

Before adding/changing any code to the project, we create a branch on our fork dedicated to this task.

The whole vizlab process is coordinated by the viz.yaml file. Since we are "fetching" new data, we add a new job in the fetch section:

fetch:
  -
  *************bunch of stuff already there************
  -
    id: old_thermo_data
    location: cache/fetch/thermoelectric_1955.csv
    fetcher: sciencebase
    remoteItemId: "XXX123"
    remoteFilename: "thermoelectric_1955.csv"
    mimetype: text/csv
    scripts:

What is this? We've added a new fetch job, and given it some information:

A unique id "old_thermo_data". Downstream parts of this project that depend on this data will need to refer to this step by the unique id.
A "location", which is the relative path where the file that is produced from this job will be stored.
The "fetcher" field indicates to vizlab to go to ScienceBase, look for the ScienceBase's ID "XXX123", and pull down the file "thermoelectric_1955.csv" from that ID.
An optional field "mimetype", this allows future jobs that depends on this data to know how to open the data.
A "scripts"" field is where we could put custom scripts. In this particular case, we don't actually need any custom scripts because the 'fetcher' refers to a function already defined in vizlab (so we don't need to define our own). However, currently vizlab will give a warning if this field is missing.

That's it for Task #1! We commit our changes to the viz.yaml, push them up to our fork, and submit a pull request. Another member of our team should review the pull request and merge if it looks good.

Task 2: Incorporating the data

Since we are starting a new task, it's a good idea to start a new branch. This task is to merge this new data from our csv into the data that is already available. This will be a task where we edit an existing process. Here is the original "process"" field in the viz.yaml:

process:
  -
    id: calc_national_data
    location: cache/process/national_clean.rds
    processor: national_clean
    scripts: scripts/process/national_clean.R
    reader: rds
    depends:
      state_data: calc_hist_water_data

We need to add a new dependency, and fiddle with the script itself. So, let's add the new data to depends like this:

process:
  -
    id: calc_national_data
    location: cache/process/national_clean.rds
    processor: national_clean
    scripts: scripts/process/national_clean.R
    reader: rds
    depends:
      state_data: calc_hist_water_data
      old_data: old_thermo_data

So in this processing job, we've added a new entry in the "depends" field, which is the unique id for the job we completed in Task 1. What that means is that this processing job will have 2 dependent data sources. The state_data is the 1960-2015 state data, and the old_thermo_data is our recent addition.

Now, let's look at the original script in: scripts/process/national_clean.R:

process.national_clean <- function(viz = getContentInfo(viz.id = "calc_national_data")){
  library(tidyr)
  library(dplyr)

  viz.data <- readDepends(viz)
  state_data <- viz.data[['state_data']]

  national <- state_data %>%
    group_by(year, category) %>%
    summarise(value = sum(value)) %>%
    data.frame()

  national$value[national$year < 1960 & 
                   national$category %in% c("Industrial","Thermoelectric")] <- NA

  national$year[which(is.na(national$year))] = 2015

  saveRDS(national, file=viz[["location"]])
}

We need to add the data and merge:

process.national_clean <- function(viz = getContentInfo(viz.id = "calc_national_data")){
  library(tidyr)
  library(dplyr)

  viz.data <- readDepends(viz)
  state_data <- viz.data[['state_data']]
  old_data <- viz.data[['old_data']]

  national <- state_data %>%
    left_join(old_data, by=year) %>%
    group_by(year, category) %>%
    summarise(value = sum(value)) %>%
    data.frame()

  national$value[national$year < 1960 & 
                   national$category %in% c("Industrial","Thermoelectric")] <- NA

  national$year[which(is.na(national$year))] = 2015


  saveRDS(national, file=viz[["location"]])
}

Notice how the depends and location work.

Seems about right, but we should probably test. What's the easiest way to jump into your custom process task? You can load the viz object like this:

viz <- as.viz("calc_national_data")

viz
$id
[1] "calc_national_data"

$location
[1] "cache/national_clean.rds"

$processor
[1] "national_clean"

$scripts
[1] "scripts/process/national_clean.R"

$reader
[1] "rds"

$depends
$depends$state_data
[1] "calc_hist_water_data"
$depends$old_data
[1] "old_thermo_data"

$block
[1] "process"

$export
[1] FALSE

and step through our function line-by-line.

For fewer keystrokes, you can run process.national_clean in debug mode.

debug(process.national_clean)
process("calc_national_data")

Once we've verified that our job works as expected, and doesn't break anything in the vizmake() build, we can submit our next pull request on this branch, and patiently wait for our project leader to assign us a new task.

USGS-VIZLAB/vizlab documentation built on July 10, 2019, 12:08 a.m.

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

Tweet to @rdrrHQ

GitHub issue tracker

ian@mutexlabs.com