knitr::opts_chunk$set( collapse = TRUE, comment = "#>" )
A crucial first step in any data analysis pipeline is importing data. The
{rixpress} package provides a flexible set of functions, rxp_r_file,
rxp_py_file, and rxp_jl_file, to handle various data import scenarios in a
reproducible way. This vignette will guide you through the common use cases.
For more examples, check out the rixpress_demos repository.
The most straightforward case is reading a single data file from your local
project directory. You need to provide a name for the resulting R object, the
path to the file, and a read_function to process it.
library(rixpress) list( rxp_r_file( name = mtcars, path = 'data/mtcars.csv', read_function = \(x) (read.csv(file = x, sep = "|")) ), ...
In this example, rxp_r_file creates a derivation that:
data/mtcars.csv into a sandboxed build environment.\(x) (read.csv(file = x, sep =
"|")), where x is the path to the copied file inside the sandbox.mtcars for subsequent steps in the pipeline.You can also directly import a file from a URL. Simply provide the URL as the
path. {rixpress} handles the download and ensures reproducibility by caching
the file using its cryptographic hash.
library(rixpress) list( rxp_r_file( name = mtcars, path = 'https://raw.githubusercontent.com/b-rodrigues/rixpress_demos/refs/heads/master/basic_r/data/mtcars.csv', read_function = \(x) (read.csv(file = x, sep = "|")) ), ...
Behind the scenes, {rixpress} uses Nix to fetch the file, ensuring that the
exact same version of the file is used every time the pipeline is run. This is
the only time the build sandbox can access a remote file: it's because the file
actually gets downloaded by Nix ahead of time. If you need to access data in
real-time from an API, then you'll need to download the data yourself
outside of {rixpress} pipeline, and then import it in the pipeline using
rxp_r_file().
Often, you need to import and combine multiple files from a single directory. To
do this, set the path argument to the directory's path. Your read_function
will then receive the path to this directory inside the build environment and
must contain the logic to handle all the files within.
Here is an example in R that reads all files in the data directory:
library(rixpress) list( rxp_r_file( name = mtcars_r, path = 'data', read_function = \(x) { (readr::read_delim(list.files(x, full.names = TRUE), delim = '|')) } ) ) |> rxp_populate(project_path = ".")
And here's a similar example using Python, which calls a user-defined function
read_many_csvs from an external script:
library(rixpress) list( rxp_py_file( name = mtcars_py, path = 'data', read_function = "read_many_csvs", user_functions = "functions.py" ) ) |> rxp_populate(project_path = ".")
Here is what the Python function looks like:
import polars from pathlib import Path def read_many_csvs(dir_path): folder = Path(dir_path) csv_files = folder.glob("*.csv") return polars.concat([polars.read_csv(f) for f in csv_files])
In both cases, the entire data directory is copied into the build sandbox, and
the read_function is responsible for listing the files and reading them.
Some file formats, like the ESRI Shapefile, consist of multiple "sidecar" files
(e.g., .shp, .shx, .dbf) that must be present together for the data to be
read correctly. Even though you might only point the read function to the .shp
file, the other component files need to be in the same directory.
{rixpress} handles this by allowing you to specify a directory as the path.
This ensures all necessary files are copied into the build environment. However,
you must then provide the full path to the main file inside the build
environment within your read_function.
In a {rixpress} pipeline, local files and directories specified in path are
copied into a sub-directory called input_folder. Therefore, the path to your
data inside the Nix sandbox will be input_folder/YOUR_PATH.
The following example shows how to read a shapefile using Python and
geopandas:
library(rixpress) list( rxp_py_file( name = gdf, # We provide the directory 'data' to ensure all shapefile components are copied. path = 'data', # The read_function must use the hardcoded path within the build environment. read_function = "lambda x: geopandas.read_file('input_folder/data/oceans.shp', driver='ESRI Shapefile')" ), rxp_py( name = sa, expr = "gdf.loc[gdf['Oceans'] == 'South Atlantic Ocean']['geometry'].loc[0]" ) ) |> rxp_populate(project_path = ".")
Here's what happens:
path = 'data' argument tells {rixpress} to copy the entire data
directory into the sandbox.input_folder/data/oceans.shp.read_function is a lambda function that explicitly calls
geopandas.read_file with this hardcoded path, allowing it to find the
.shp file and its necessary sidecar files.A perhaps cleaner alternative is to write a function that takes the path to the
data folder as an input, and then have this function look in that folder for the
shapefile, and pass its path to geopandas.read_file. For example
def read_shp(path_folder): # Look for files ending with .shp in the given folder candidates = glob.glob(os.path.join(path_folder, "*.shp")) if not candidates: raise FileNotFoundError(f"No .shp file found in {path_folder}") shapefile = candidates[0] return gpd.read_file(shapefile, driver="ESRI Shapefile")
We can then rewrite the derivation like so:
rxp_py_file( name = gdf, path = 'data', read_function = "read_shp", user_functions = "functions.py" ),
(assuming our function is defined in a script called functions.py).
Because our Python function also uses glob and os, we need to import these
functions using add_import(). We can add this just after calling
rxp_populate():
rxp_populate( project_path = ".", py_imports = c(geopandas = "import geopandas as gpd") ) # This is needed for the function defined in functions.py add_import("import os", "default.nix") add_import("import glob", "default.nix")
The rxp_*_file functions in {rixpress} offer a powerful and consistent
interface for ingesting data into your reproducible pipelines, whether your data
lives locally, on the web, as a single file, or as a collection of files. By
understanding how to specify the path and tailor the read_function, you can
handle a wide variety of data import tasks.
Any scripts or data that you put into this service are public.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.