Importing Data Files
In rixpress: Build Reproducible Analytical Pipelines with 'Nix'

knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)

A crucial first step in any data analysis pipeline is importing data. The {rixpress} package provides a flexible set of functions, rxp_r_file, rxp_py_file, and rxp_jl_file, to handle various data import scenarios in a reproducible way. This vignette will guide you through the common use cases.

For more examples, check out the rixpress_demos repository.

Importing a single local file

The most straightforward case is reading a single data file from your local project directory. You need to provide a name for the resulting R object, the path to the file, and a read_function to process it.

library(rixpress)

list(
  rxp_r_file(
    name = mtcars,
    path = 'data/mtcars.csv',
    read_function = \(x) (read.csv(file = x, sep = "|"))
  ),
...

In this example, rxp_r_file creates a derivation that:

Copies data/mtcars.csv into a sandboxed build environment.
Executes the provided anonymous function, \(x) (read.csv(file = x, sep = "|")), where x is the path to the copied file inside the sandbox.
Saves the resulting data frame as an object named mtcars for subsequent steps in the pipeline.

Importing a single file from the internet

You can also directly import a file from a URL. Simply provide the URL as the path. {rixpress} handles the download and ensures reproducibility by caching the file using its cryptographic hash.

library(rixpress)

list(
  rxp_r_file(
    name = mtcars,
    path = 'https://raw.githubusercontent.com/b-rodrigues/rixpress_demos/refs/heads/master/basic_r/data/mtcars.csv',
    read_function = \(x) (read.csv(file = x, sep = "|"))
  ),
...

Behind the scenes, {rixpress} uses Nix to fetch the file, ensuring that the exact same version of the file is used every time the pipeline is run. This is the only time the build sandbox can access a remote file: it's because the file actually gets downloaded by Nix ahead of time. If you need to access data in real-time from an API, then you'll need to download the data yourself outside of {rixpress} pipeline, and then import it in the pipeline using rxp_r_file().

Importing many files from a directory

Often, you need to import and combine multiple files from a single directory. To do this, set the path argument to the directory's path. Your read_function will then receive the path to this directory inside the build environment and must contain the logic to handle all the files within.

Here is an example in R that reads all files in the data directory:

library(rixpress)

list(
  rxp_r_file(
    name = mtcars_r,
    path = 'data',
    read_function = \(x) {
      (readr::read_delim(list.files(x, full.names = TRUE), delim = '|'))
    }
  )
) |>
  rxp_populate(project_path = ".")

And here's a similar example using Python, which calls a user-defined function read_many_csvs from an external script:

library(rixpress)

list(
  rxp_py_file(
    name = mtcars_py,
    path = 'data',
    read_function = "read_many_csvs",
    user_functions = "functions.py"
  )
) |>
  rxp_populate(project_path = ".")

Here is what the Python function looks like:

import polars
from pathlib import Path

def read_many_csvs(dir_path):
    folder = Path(dir_path)
    csv_files = folder.glob("*.csv")
    return polars.concat([polars.read_csv(f) for f in csv_files])

In both cases, the entire data directory is copied into the build sandbox, and the read_function is responsible for listing the files and reading them.

Importing files with dependencies (e.g., Shapefiles)

Some file formats, like the ESRI Shapefile, consist of multiple "sidecar" files (e.g., .shp, .shx, .dbf) that must be present together for the data to be read correctly. Even though you might only point the read function to the .shp file, the other component files need to be in the same directory.

{rixpress} handles this by allowing you to specify a directory as the path. This ensures all necessary files are copied into the build environment. However, you must then provide the full path to the main file inside the build environment within your read_function.

In a {rixpress} pipeline, local files and directories specified in path are copied into a sub-directory called input_folder. Therefore, the path to your data inside the Nix sandbox will be input_folder/YOUR_PATH.

The following example shows how to read a shapefile using Python and geopandas:

library(rixpress)

list(
  rxp_py_file(
    name = gdf,
    # We provide the directory 'data' to ensure all shapefile components are copied.
    path = 'data',
    # The read_function must use the hardcoded path within the build environment.
    read_function = "lambda x: geopandas.read_file('input_folder/data/oceans.shp', driver='ESRI Shapefile')"
  ),

  rxp_py(
    name = sa,
    expr = "gdf.loc[gdf['Oceans'] == 'South Atlantic Ocean']['geometry'].loc[0]"
  )
) |>
  rxp_populate(project_path = ".")

Here's what happens:

The path = 'data' argument tells {rixpress} to copy the entire data directory into the sandbox.
Inside the sandbox, the shapefile is located at input_folder/data/oceans.shp.
The read_function is a lambda function that explicitly calls geopandas.read_file with this hardcoded path, allowing it to find the .shp file and its necessary sidecar files.

A perhaps cleaner alternative is to write a function that takes the path to the data folder as an input, and then have this function look in that folder for the shapefile, and pass its path to geopandas.read_file. For example

def read_shp(path_folder):
    # Look for files ending with .shp in the given folder
    candidates = glob.glob(os.path.join(path_folder, "*.shp"))
    if not candidates:
        raise FileNotFoundError(f"No .shp file found in {path_folder}")

    shapefile = candidates[0]
    return gpd.read_file(shapefile, driver="ESRI Shapefile")

We can then rewrite the derivation like so:

rxp_py_file(
    name = gdf,
    path = 'data',
    read_function = "read_shp",
    user_functions = "functions.py"
  ),

(assuming our function is defined in a script called functions.py).

Because our Python function also uses glob and os, we need to import these functions using add_import(). We can add this just after calling rxp_populate():

  rxp_populate(
    project_path = ".",
    py_imports = c(geopandas = "import geopandas as gpd")
  )

# This is needed for the function defined in functions.py
add_import("import os", "default.nix")
add_import("import glob", "default.nix")

Conclusion

The rxp_*_file functions in {rixpress} offer a powerful and consistent interface for ingesting data into your reproducible pipelines, whether your data lives locally, on the web, as a single file, or as a collection of files. By understanding how to specify the path and tailor the read_function, you can handle a wide variety of data import tasks.