In degauss-org/dht: A Collection of Functions to Assist Building DeGAUSS Containers

The following guide is a step-by-step instruction manual for developing a DeGAUSS container using tools available in the dht R package.

Creating all files needed

Within an empty directory, use dht::use_degauss_container() to create all of the files needed for a DeGAUSS container. Note that the name of this initially empty directory will be used as the name of the geomarker in the documentation, code, and image repository; it is a good idea to only use lower case letters and underscores (e.g., census_block_group) to comply with container naming conventions.

Most of the added files will usually not need to be edited:

However, some files must be edited:

Editing files to add R code and documentation

Documentation

Fill out the "Using", "Geomarker Methods", and "Geomarker Data" sections in README.md, as applicable. Make sure that the version in the example call matches the version of the latest released container.

Within the Dockerfile, ENV instructions are used to define environment variables that capture metadata about the DeGAUSS container, including the name (degauss_name), version (degauss_version), and a short description (degauss_description). These environment variables will be available to R code running from inside the container and can be used for dht::greeting() as well as other {dht} functions that read and write geomarker data. Outside of a DeGAUSS container, these are also used by get_degauss_env_dockerfile(), get_degauss_env_online(), and get_degauss_core_lib_env().

When creating a new DeGAUSS container, all but degauss_description are automatically defined and this value needs to be edited from

ENV degauss_description="insert short description here that finishes the sentence 'This container returns ...'"

to something specific and short (ideally less than 50 characters) that finishes the sentence "This container returns ...", like

ENV degauss_description="proximity and length of major roads"

When releasing a new version of the DeGAUSS container in the future, the environment variable degauss_version can be edited in the Dockerfile and R code in the container can use it for greetings, writing output files, and other operations that depend on the current version (or name or description). This prevents the need for manually changing the version number in several different locations for each new release.

Using R code and data files inside the container

Edit entrypoint.R by replacing the example R code with R code that completes the specific task to be performed by the container.

When ready to build the container, run renv::init() to initiate the renv framework and create renv.lock. Subsequent builds can update renv.lock by using renv::snapshot().

Adding required system dependencies

R packages, especially spatial packages, often depend on external system dependencies. These will need to be installed using RUN instructions in the Dockerfile. For example, the {sf} package for R requires gdal and other programs, each of which require different install processes depending on the operating system. Since these R packages will always be running inside of a Docker container running Ubuntu 20.04, we can use remotes::system_requirements() to get the specific required install instructions for {sf}:

remotes::system_requirements("ubuntu-20.04", package = "sf")
# "apt-get install -y libudunits2-dev" "apt-get install -y libssl-dev"
# "apt-get install -y libgdal-dev"     "apt-get install -y gdal-bin"
# "apt-get install -y libgeos-dev"     "apt-get install -y libproj-dev"

These system requirements could be translated to RUN instructions for the Dockerfile to make sure they are available before the R packages that require them are installed:

RUN apt-get update \
    && apt-get install -yqq --no-install-recommends \
    libudunits2-dev \
    libssl-dev \
    libgdal-dev \
    gdal-bin \
    libgeos-dev \
    libproj-dev \
    && apt-get clean

Including files inside the container

By default, the container will copy in entrypoint.R and renv.lock for use at runtime and ignore anything else in the working directory to automatically speed up build times and keep containers smaller in size. If the container requires any other files (e.g., .rds datafiles), edit Dockerfile and .dockerignore so that the files are copied to the container and not ignored by Docker. For example, if we want to use geomarker_data.rds, we would make the following changes in Dockerfile:

    COPY entrypoint.R .
    COPY geomarker_data.rds    # copy .rds file from host to container when building

and in .dockerignore:

    # ignore everything
    **

    # except what we need
    !/renv.lock
    !/entrypoint.R
    !/geomarker_data.rds      # make sure the .rds file is not ignored

Tests

A test directory is added with an example geocoded address file (test/my_address_file_geocoded.csv) and is useful for interactive development and automated testing (see below).

Using `make` for interactive development

The Makefile defines several useful make targets that can be useful when locally developing and testing:

make build will build the current DeGAUSS image and name it
make test will run the container on the included example geocoded CSV file
make shell will run a DeGAUSS command, but start an interactive shell inside the container for debugging
make clean is equivalent to docker system prune -f, which cleans up any stopped containers or dangling image layers