README.md

peskas.timor.data.pipeline

Lifecycle:
experimental CRAN
status Codecov test
coverage R build
status

The goal of peskas.timor.data.pipeline is to implement, deploy, and execute the data and modelling pipelines that underpin Peskas-East Timor, the small-scale fisheries analytics in East Timor.

The pipeline is an R package

peskas.timor.data.pipeline is structured as an R package because it makes it easier to write production-grade software. Specifically, structuring the code as an R package allows us to:

We make heavy use of tidyverse style conventions and the usethis package to automate tasks during project setup and deployment.

For more information about the rationale of structuring the pipeline as a package check Chapter 3 in Engineering Production-Grade Shiny Apps. The book is focused on Shiny applications but the rationale also applies to data pipelines and production-ready code in general. The best place to learn more about package development is probably the R packages book by Hadley Wickham and Jenny Brian.

The pipeline runs on Github Actions

While each step in the pipeline are defined as a function in the package, these functions are deployed and integrated using GitHub Actions. This allow us to take advantages of best practices in continous development and integration (CD/CI) and automatically link the code to execution. However, these workflow functions work almost as scripts because they don’t take parameters and are used for their side effects.

Each job in the pipeline is defined in the workflow file: .github/workflows/data-pipeline.yaml and can be seen in the figure below. Note that additional workflows exist to test the package in multiple environments and build the documentation website.

The figure above illustrate the jobs that are part of the pipeline workflow. Note that not all of them are implemented yet.

Generally, artifacts produced by each job are stored in a cloud storage container and retrieved from the cloud storage by the next job in the pipeline. When storing a job’s artifacts are versioned using the function add_version(), which generally includes a timestamp and the commit sha. This approach allow us to trace each artifact to a unique run of the pipeline. When retrieving jobs can call cloud_object_name() to obtain the latest or an specific version of an artifact.

Environment parameters are specified in the config file

The parameters that determine how the pipeline is run are specified in inst/conf.yml. This file can be accessed using system.file("conf.yml",package="peskas.timor.data.pipeline"). Using this file, as opposed to, for example, including them in the code, allows us to easily switch parameters depending on the environment. We use the config package to read the configuration file. We use three different environments (see below). To determine which environment to use, the config package checks the environment variable R_CONFIG_ACTIVE.

We use docker containers

We use docker containers to make it easier to run and develop code.

Logging

We use the logger package to log events in production.



WorldFishCenter/peskas.timor.data.pipeline documentation built on April 14, 2025, 1:47 p.m.