# These settings make the vignette prettier
# knitr::opts_chunk$set(results="hold", collapse=FALSE, message=FALSE)

Introduction

Unitar is a simple R package that wraps the targets package. To use unitar, you will need to already be familiar with targets, because the functionality is just an extension of targets and the naming conventions all follow the targets package approach. Unitar adds new functionality to use targets that span projects, which is outside the scope of the base targets package. With unitar, you can very easily link two targets projects, loading in built targets from other projects so that you can share caches and computing across users and projects.

Installing unitar

Install unitar like this:

devtools::install_github("databio/unitar")

Three modes of use

There are three primary ways to use unitar, which differ in the way they treat external targets. External targets are targets that are not part of the current project, but belong to a different targets project. The 3 modes are:

  1. Basic loading of external targets: Load targets computed in other projects. Don't track these in the current targets project, just re-use them, and let all tracking happen in the external project. This is the basic case.

  2. Load and track external targets locally: Load external targets and also track them here. Duplicate external caches into local caches. If the external data changes, it will update the local caches, and all local files for the current project.

  3. Load and track external targets, but don't duplicate caches: Load external targets and track the original files here, but don't duplicate them into here. This can you do by just adding a "file" target with the external target file, and then writing functions that use this to produce whatever derived targets you need. This way, your targets will update if the external target changes, but you don't have to pay the cost of storing the cache twice.

For my use case, #3 is the most useful, but the others may also be useful depending on what you want.

Use case #1: Basic loading of external targets

In the target_projects, we have 2 subfolders. Each of these represents a separate project that uses targets; you can find a _targets.R file in each subfolder. These represent your typical, independent targets project folders.

Here's how you'd load these targets in using unitar_read:

library("unitar")
project_root = system.file("extdata", "target_projects", package="unitar")

project_subfolders = list.files(project_root)
project_folders = paste(project_root, project_subfolders, sep="/")

unitar::unitar_make(project_folders)

big_data_set = unitar::unitar_read("big_data_set", project_folders)
head(big_data_set)

unitar_read works like tar_read, but you give it a priority list of targets folders to search, and so it can search outside your current targets environment. This way you can share built targets across projects.

unitar_meta works like tar_meta, but runs across all the given project folders:

unitar_meta(project_folders)

I prefer the *_read approach because it's explicit, but if you prefer to use the common R idiom of loading data using function side effects, you can also use unitar_load, which mimics the tar_load functionality:

unitar_load("big_data_set", project_folders)
head(big_data_set)

Use case #2: Load and track external targets locally

If there's a target in an external project that you want to use in your current project, the basic way (above) just loaded the target in from that other project. Another way to do it is to actually track that external target as a target in the local project. The difference is that with this way, you'll duplicate the cache of the file into the local folder. This duplication could be either an advantage or a disadvantage. If you want your project to be self-contained, with all your targets in the same folder, then you may want to track the files like that.

To do this, we will need to add new targets to the our local _targets.R, so it will track the external targets. To do this, use unitar_read_xprj() like this:

# _targets.R

library("unitar")
tar_dirs="../refdata1"  # external targets projects you want to track

list(
  unitar_read_xprj("big_data_set", tar_dirs)
)

So, it's pretty simple, really. You can think of this as just registering any external targets into your local targets list. The unitar_read_xprj is a target factory that makes this super easy.

Use case #3: Tracking external targets but not duplicating

Finally, there's a third approach. What if you want to track the other external targets, so your stuff updates when they change, but you don't want to duplicate large caches into your local folder? This leads to option 3. Here, what you want to do is add the external file to your target list so it gets tracked, and use that as input, and then use unitar_read_from_path in a local function to process that data into the subset you want to keep locally.

Here's an example:

# _targets.R
library("targets")
library("unitar")

# Function that takes an external dataset (from another targets project),
# and returns a modified version for this project.
local_filter_big_reference_data = function(big_data_set_path) {
  big_data_set = unitar_read_from_path(big_data_set_path)
  big_data_set[big_data_set > 2]
}

list(
  tar_target(
    big_data_set_path,
    unitar_path("big_data_set", "../refdata1"),
    format = "file"
  ),
  tar_target(
    filtered_data_set,
    local_filter_big_reference_data(big_data_set_path)
  )
)

Now, if big_data_set (from an external project) changes, that will invalidate your filtered_data_set, which will be recomputed. But you don't actually duplicate big_data_set into your local targets cache, saving space.

Avoiding passing the tar_dirs

Regardless of which of these options we choose, we're going to end up making calls in our scripts like thiS:

big_data_set = unitar::unitar_read("big_data_set", project_folders)

But it's more work to set up and pass around a projects_folders variable like this, relative to the simpler targest::tar_read("big_data_set") you could do in a local targets project. To get around this, all relevant unitar functions provide the ability to use R options as global variables, so you can set them once and then not worry about passing the project folders with every call. It's pretty easy to set up:

options(tar_dirs=project_folders)

Then you just leave that argument off of the unitar_* calls:

big_data_set = unitar::unitar_read("big_data_set")

Now, it's just as easy as using targets natively, but you get the added power of grabbing targets from multiple projects.

Configuring external target folders with pepr

The one possible place where this may not be ideal is if you want to maintain separate lists of folders, and query them separately. Using the options(tar_dirs=project_folders) approach limits you to a single project_folders variable. Alternatively, you can provide a list of folders in a configuration file for your project. unitar uses a project configuration file in standard PEP format. You specify a list of target folders using the tprojects attribute, which may specify paths either absolute or relative to the configuration file.

For example, the project_config.yaml file for the above project_folders example might look like this:

pep_version: "2.0.0"

tprojects:
  - ../target_projects/refdata1/
  - ../target_projects/refdata2/
  - ../

To make this work, unitar provides a series of peptar_* functions that work like the unitar_* functions above, but operate on PEPs that you can configure with your list of project folders.

pep_config = system.file("extdata", "metadata/project_config.yaml", package="unitar")
p = pepr::Project(pep_config)

unitar::peptar_dirs(p)

unitar::peptar_path(p, "ref2")
unitar::peptar_meta(p, fields="name")
unitar::peptar_make(p)

big_data_set2 = peptar_read(p, "big_data_set")
head(big_data_set2)

Or, using peptar_load:

peptar_load(p, "big_data_set")
head(big_data_set)


databio/unitar documentation built on May 21, 2021, 11:25 p.m.