View source: R/tar_repository_cas.R
tar_repository_cas | R Documentation |
Define a custom storage repository that uses content-addressable storage (CAS).
tar_repository_cas(
upload,
download,
exists,
consistent = FALSE,
substitute = list()
)
upload |
A function with arguments To differentiate between
See the "Repository functions" section for more details. |
download |
A function with arguments Please be careful to avoid deleting the object at See the "Repository functions" section for more details. |
exists |
A function with a single argument For efficiency, See the "Repository functions" section for more details. |
consistent |
Logical. Set to A data storage system is said to have strong read-after-write consistency if a new object is fully available for reading as soon as the write operation finishes. Many modern cloud services like Amazon S3 and Google Cloud Storage have strong read-after-write consistency, meaning that if you upload an object with a PUT request, then a GET request immediately afterwards will retrieve the precise version of the object you just uploaded. Some storage systems do not have strong read-after-write consistency.
One example is network file systems (NFS). On a computing cluster,
if one node creates a file on an NFS, then there is a delay before
other nodes can access the new file. |
substitute |
Named list of values to be inserted into the
body of each custom function in place of symbols in the body.
For example, if
Please do not include temporary or sensitive information
such as authentication credentials.
If you do, then |
Normally, targets
organizes output data
based on target names. For example,
if a pipeline has a single target x
with default settings,
then tar_make()
saves the output data to the file
_targets/objects/x
. When the output of x
changes, tar_make()
overwrites _targets/objects/x
.
In other words, no matter how many changes happen to x
,
the data store always looks like this:
_targets/ meta/ meta objects/ x
By contrast, with content-addressable storage (CAS),
targets
organizes outputs based on the hashes of their contents.
The name of each output file is its hash, and the
metadata maps these hashes to target names. For example, suppose
target x
has repository = tar_repository_cas_local("my_cas")
.
When the output of x
changes, tar_make()
creates a new file
inside my_cas/
without overwriting or deleting any other files
in that folder. If you run tar_make()
three different times
with three different values of x
, then storage will look like this:
_targets/ meta/ meta my_cas/ 1fffeb09ad36e84a 68328d833e6361d3 798af464fb2f6b30
The next call to tar_read(x)
uses tar_meta(x)$data
to look up the current hash of x
. If tar_meta(x)$data
returns
"1fffeb09ad36e84a"
, then tar_read(x)
returns the data from
my_cas/1fffeb09ad36e84a
. Files my_cas/68328d833e6361d3
and
and my_cas/798af464fb2f6b30
are left over from previous values of x
.
Because CAS accumulates historical data objects,
it is ideal for data versioning and collaboration.
If you commit the _targets/meta/meta
file to version control
alongside the source code,
then you can revert to a previous state of your pipeline with all your
targets up to date, and a colleague can leverage your hard-won
results using a fork of your code and metadata.
The downside of CAS is the cost of accumulating many data objects over time. Most pipelines that use CAS should have a garbage collection system or retention policy to remove data objects when they no longer needed.
The tar_repository_cas()
function lets you create your own CAS system
for targets
. You can supply arbitrary custom methods to upload,
download, and check for the existence of data objects. Your custom
CAS system can exist locally on a shared file system or remotely
on the cloud (e.g. in an AWS S3 bucket).
See the "Repository functions" section and the documentation
of individual arguments for advice on how
to write your own methods.
The tar_repository_cas_local()
function has an example
CAS system based on a local folder on disk.
It uses tar_cas_u()
for uploads,
tar_cas_d()
for downloads, and
tar_cas_e()
for existence.
In tar_repository_cas()
, functions upload
, download
,
and exists
must be completely pure and self-sufficient.
They must load or namespace all their own packages,
and they must not depend on any custom user-defined
functions or objects in the global environment of your pipeline.
targets
converts each function to and from text,
so it must not rely on any data in the closure.
This disqualifies functions produced by Vectorize()
,
for example.
upload
and download
can assume length(path)
is 1, but they should
account for the possibility that path
could be a directory. To simply
avoid supporting directories, upload
could simply call an assertion:
targets::tar_assert_not_dir( path, msg = "This CAS upload method does not support directories." )
Otherwise, support for directories may require handling them as a
special case. For example, upload
and download
could copy
all the files in the given directory,
or they could manage the directory as a zip archive.
Some functions may need to be adapted and configured based on other
inputs. For example, you may want to define
upload = \(key, path) file.rename(path, file.path(folder, key))
but do not want to hard-code a value of folder
when you write the
underlying function. The substitute
argument handles this situation.
For example, if substitute
is list(folder = "my_folder")
,
then upload
will end up as
\(key, path) file.rename(path, file.path("my_folder", key))
.
Temporary or sensitive such as authentication credentials
should not be injected
this way into the function body. Instead, pass them as environment
variables using tar_resources_repository_cas()
.
Other content-addressable storage:
tar_repository_cas_local()
,
tar_repository_cas_local_gc()
if (identical(Sys.getenv("TAR_EXAMPLES"), "true")) { # for CRAN
tar_dir({ # tar_dir() runs code from a temp dir for CRAN.
tar_script({
library(targets)
library(tarchetypes)
repository <- tar_repository_cas(
upload = function(key, path) {
if (dir.exists(path)) {
stop("This CAS repository does not support directory outputs.")
}
if (!file.exists("cas")) {
dir.create("cas", recursive = TRUE)
}
file.rename(path, file.path("cas", key))
},
download = function(key, path) {
file.copy(file.path("cas", key), path)
},
exists = function(key) {
file.exists(file.path("cas", key))
}
)
write_file <- function(object) {
writeLines(as.character(object), "file.txt")
"file.txt"
}
list(
tar_target(x, c(2L, 4L), repository = repository),
tar_target(
y,
x,
pattern = map(x),
format = "qs",
repository = repository
),
tar_target(z, write_file(y), format = "file", repository = repository)
)
})
tar_make()
tar_read(y)
tar_read(z)
list.files("cas")
tar_meta(any_of(c("x", "z")), fields = any_of("data"))
})
}
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.