Nothing
##' Extract metadata from a group of packets. This is an
##' **experimental** high-level function for interacting with the
##' metadata in a way that we hope will be useful. We'll expand this a
##' bit as time goes on, based on feedback we get so let us know what
##' you think. See Details for how to use this.
##'
##' Extracting data from outpack metadata is challenging to do in a
##' way that works in data structures familiar to R users, because it
##' is naturally tree structured, and because not all metadata may be
##' present in all packets (e.g., a packet that does not depend on
##' another will not have a dependency section, and one that was run
##' in a context without git will not have git metadata). If you just
##' want the raw tree-structured data, you can always use
##' [orderly_metadata()] to load the full metadata for any
##' packet (even one that is not currently available on your computer,
##' just known about it) and the structure of the data will remain
##' fairly constant across orderly versions.
##'
##' However, sometimes we want to extract data in order to ask
##' specific questions like:
##'
##' * what parameter combinations are available across a range of packets?
##' * when were a particular set of packets used?
##' * what files did these packets produce?
##'
##' Later we'd like to ask even more complex questions like:
##'
##' * at what version did the file `graph.png` change?
##' * what inputs changed between these versions?
##'
##' ...but being able to answer these questions requires a similar
##' approach to interrogating metadata across a range of packets.
##'
##' The `orderly_metadata_extract` function aims to simplify the
##' process of pulling out bits of metadata and arranging it into a
##' `data.frame` (of sorts) for you. It has a little mini-language in
##' the `extract` argument for doing some simple rewriting of results,
##' but you can always do this yourself.
##'
##' In order to use function you need to know what metadata are
##' available; we will expand the vignette with more worked examples
##' here to make this easier to understand. The function works on
##' top-level keys, of which there are:
##'
##' * id: the packet id (this is always returned)
##' * name: the packet name
##' * parameters: a key-value pair of values, with string keys and
##' atomic values. There is no guarantee about presence of keys
##' between packets, or their types.
##' * time: a key-value pair of times, with string keys and time
##' values (see [DateTimeClasses]; these are stored as seconds since
##' 1970 in the actual metadata). At present `start` and `end` are
##' always present.
##' * files: files present in each packet. This is a `data.frame` (per
##' packet), each with columns `path` (relative), `size` (in bytes)
##' and `hash`.
##' * depends: dependencies used each packet. This is a `data.frame`
##' (per packet), each with columns `packet` (id), `query` (string,
##' used to find `packet`) and `files` (another `data.frame` with
##' columns `there` and `here` corresponding to filenames upstream
##' and in this packet, respectively)
##' * git: either metadata about the state of git or `null`. If given
##' then `sha` and `branch` are strings, while `url` is an array of
##' strings/character vector (can have zero, one or more elements).
##' * session: some information about the session that the packet was run in
##' (this is unstandardised, and even the orderly version may change)
##' * custom: additional metadata added by its respective engine. For
##' packets run by `orderly`, there will be an `orderly` field here,
##' which is itself a list:
##' * artefacts: A [data.frame] with artefact information, containing
##' columns `description` (a string) and `paths` (a list column of paths).
##' * shared: A [data.frame] of the copied shared resources with
##' their original name (`there`) and name as copied into the packet
##' (`here`).
##' * role: A [data.frame] of identified roles of files, with columns `path`
##' and `role`.
##' * description: A list of information from
##' [orderly_description()] with human-readable descriptions and
##' tags.
##' * session: A list of information about the session as run,
##' with a list `platform` containing information about the platform
##' (R version as `version`, operating system as `os` and system name
##' as `system`) and `packages` containing columns `package` ,
##' `version` and `attached`.
##'
##' The nesting here makes providing a universally useful data format
##' difficult; if considering files we have a `data.frame` with a
##' `files` column, which is a list of `data.frame`s; similar
##' nestedness applies to `depends` and the orderly custom
##' data. However, you should be able to fairly easily process the
##' data into the format you need it in.
##'
##' The simplest extraction uses names of top-level keys:
##'
##' ```
##' extract = c("name", "parameters", "files")
##' ```
##'
##' This creates a data.frame with columns corresponding to these
##' keys, one row per packet. Because `name` is always a string, it
##' will be a character vector, but because `parameters` and `files`
##' are more complex, these will be list columns.
##'
##' You must not provide `id`; it is always returned and always first
##' as a character vector column. If your extraction could possibly
##' return data from locations (i.e., you have `allow_remote = TRUE`
##' or have given a value for `location`) then we add a logical column
##' `local` which indicates if the packet is local to your archive,
##' meaning that you have all the files from it locally.
##'
##' You can rename the columns by providing a name to entries within
##' `extract`, for example:
##'
##' ```
##' extract = c("name", pars = "parameters", "files")
##' ```
##'
##' is the same as above, except that that the `parameters` column has
##' been renamed `pars`.
##'
##' More interestingly, we can index into a structure like
##' `parameters`; suppose we want the value of the parameter `x`, we
##' could write:
##'
##' ```
##' extract = c(x = "parameters.x")
##' ```
##'
##' which is allowed because for *each packet* the `parameters`
##' element is a list.
##'
##' However, we do not know what type `x` is (and it might vary
##' between packets). We can add that information ourselves though and write:
##'
##' ```
##' extract = c(x = "parameters.x is number")
##' ```
##'
##' to create an numeric column. If any packet has a value of `x` that
##' is non-integer, your call to `orderly_metadata_extract` will fail
##' with an error, and if a packet lacks a value of `x`, a missing
##' value of the appropriate type will be added.
##'
##' Note that this does not do any coercion to number, it will error
##' if a non-NULL non-numeric value is found. Valid types for use
##' with `is <type>` are `boolean`, `number` and `string` (note that
##' these differ slightly from R's names because we want to emphasise
##' that these are *scalar* quantities; also note that there is no
##' `integer` here as this may produce unexpected errors with
##' integer-like numeric values). You can also use `list` but this is
##' the default. Things in the schema that are known to be scalar
##' atomics (such as `name`) will be automatically simplified.
##'
##' You can index into the array-valued elements (`files` and
##' `depends`) in the same way as for the object-valued elements:
##'
##' ```
##' extract = c(file_path = "files.path", file_hash = "files.hash")
##' ```
##'
##' would get you a list column of file names per packet and another
##' of hashes, but this is probably less useful than the `data.frame`
##' you'd get from extracting just `files` because you no longer have
##' the hash information aligned.
##'
##' You can index fairly deeply; it should be possible to get the
##' orderly "display name" with:
##'
##' ```
##' extract = c(display = "custom.orderly.description.display is string")
##' ```
##'
##' If the path you need to extract has a dot in it (most likely a
##' package name for a plugin, such as `custom.orderly.db`) you need
##' to escape the dot with a backslash (so, `custom.orderly\.db`). You
##' will probably need two slashes or use a raw string (in recent
##' versions of R).
##'
##' @section Custom 'orderly' metadata:
##'
##' Within `custom.orderly`, additional fields can be extracted. The
##' format of this is subject to change, both in the stored metadata
##' and schema (in the short term) and in the way we deserialise it.
##' It is probably best not to rely on this right now, and we will
##' expand this section when you can.
##'
##' @title Extract metadata from orderly packets
##'
##' @param extract A character vector of columns to extract, possibly
##' named. See Details for the format.
##'
##' @inheritParams orderly_metadata
##' @inheritParams orderly_search
##' @inheritParams orderly_search_options
##'
##' @return A `data.frame`, the columns of which vary based on the
##' names of `extract`; see Details for more information.
##'
##' @export
##' @examples
##' path <- orderly_example()
##'
##' # Generate a bunch of packets:
##' suppressMessages({
##' orderly_run("data", echo = FALSE, root = path)
##' for (n in c(2, 4, 6, 8)) {
##' orderly_run("parameters", list(max_cyl = n), echo = FALSE, root = path)
##' }
##' })
##'
##' # Without a query, we get a summary over all packets; this will
##' # often be too much:
##' orderly_metadata_extract(root = path)
##'
##' # Pass in a query to limit things:
##' meta <- orderly_metadata_extract(quote(name == "parameters"), root = path)
##' meta
##'
##' # The parameters are present as a list column:
##' meta$parameters
##'
##' # You can also lift values from the parameters into columns of their own:
##' orderly_metadata_extract(
##' quote(name == "parameters"),
##' extract = c(max_cyl = "parameters.max_cyl is number"),
##' root = path)
orderly_metadata_extract <- function(expr = NULL, name = NULL, location = NULL,
allow_remote = NULL,
fetch_metadata = FALSE,
extract = NULL, options = NULL,
root = NULL) {
root <- root_open(root, require_orderly = FALSE)
compatibility_fix_options(options, "orderly_metadata_extract")
if (expr_is_literal_id(expr, name)) {
ids <- expr
} else {
ids <- orderly_search(expr,
name = name,
location = location,
allow_remote = allow_remote,
fetch_metadata = fetch_metadata,
root = root)
}
extract <- parse_extract(extract, environment())
is_core_metadata <-
vlapply(extract$from, function(el) el[[1]] %in% metadata_core_names)
if (all(is_core_metadata)) {
meta <- lapply(ids, outpack_metadata_core, root = root)
} else {
meta <- lapply(ids, orderly_metadata, root = root)
}
envir <- environment()
ret <- data_frame(id = ids)
if (isTRUE(allow_remote) || length(location) > 0) {
loc <- root$index$location(location)
loc <- loc[loc$packet %in% ids, ]
ret$local <- ids %in% root$index$unpacked()
ret$location <- I(unname(split(loc$location, loc$packet)[ids]))
}
for (i in seq_len(nrow(extract))) {
from_i <- extract$from[[i]]
is_i <- extract$is[[i]]
value_i <- lapply(meta, function(x) {
tryCatch(x[[from_i]], error = function(e) NULL)
})
ret[[extract$to[[i]]]] <- extract_convert(ids, value_i, from_i, is_i, envir)
}
ret
}
parse_extract <- function(extract, call = NULL) {
if (is.null(extract)) {
extract <- c("name", "parameters")
}
re_dot <- "(?<!\\\\)\\."
re_type <- "^(.+)\\s+is+\\s+(.+)$"
extract_as_nms <- sub(re_type, "\\1", gsub(re_dot, "_", extract, perl = TRUE))
extract_as_nms <- sub("\\.", ".", extract_as_nms, fixed = TRUE)
if (is.null(names(extract))) {
to <- extract_as_nms
} else {
to <- names(extract)
i <- !nzchar(to)
to[i] <- extract_as_nms[i]
}
is <- rep(NA_character_, length(extract))
i <- grepl(re_type, extract)
if (any(i)) {
valid <- c("string", "number", "boolean", "list")
is[i] <- unname(sub(re_type, "\\2", extract[i]))
extract[i] <- sub(re_type, "\\1", extract[i])
if (any(err <- !(is.na(is) | is %in% valid))) {
collapseq <- function(x) paste(squote(x), collapse = ", ")
invalid <- sprintf("Extraction of '%s' used type '%s'",
extract[err], is[err])
cli::cli_abort(
c(sprintf("Invalid conversion type %s requested in 'extract'",
collapseq(unique(is[err]))),
i = sprintf("'is' must be one of %s", collapseq(valid)),
set_names(invalid, rep("x", length(invalid)))),
call = call)
}
}
extract <- strsplit(extract, re_dot, perl = TRUE)
extract <- lapply(extract, function(x) gsub("\\.", ".", x, fixed = TRUE))
if ("id" %in% to) {
cli::cli_abort(
"Don't use 'id' as a column to extract; this column is always added",
call = call)
}
if (any(duplicated(to))) {
dups <- paste(squote(unique(to[duplicated(to)])), collapse = ", ")
cli::cli_abort(
c("All destination columns in 'extract' must be unique",
x = sprintf("Duplicated names: %s", dups)),
call = call)
}
data_frame(
from = I(as.list(unname(extract))),
to = to,
is = is)
}
extract_type <- function(nm, is) {
if (length(nm) == 1 && nm %in% c("id", "name")) {
"string"
} else if (length(nm) == 2 && nm[[1]] == "time") {
"time"
} else if (length(nm) == 2 && nm[[1]] == "git" && nm[[2]] != "url") {
"string"
} else if (!is.na(is)) {
is
} else {
"list"
}
}
extract_convert <- function(ids, value, from, is, call) {
examples <- NULL
type <- extract_type(from, is)
if (type == "list") {
I(value)
} else {
is_null <- vlapply(value, is.null)
len <- lengths(value)
mode <- vcapply(value, storage.mode)
## This can't be triggered until we add packages or sources back
## into the orderly custom schema
err <- len != 1 & !is_null
if (any(err)) {
msg <- sprintf(
"Expected all values of '%s' to evaluate to a scalar (if not NULL)",
paste(from, collapse = "."))
examples <- sprintf("Value for %s has length %d", ids[err], len[err])
cli::cli_abort(c(msg, error_examples(examples)), call = call)
}
## convert all integers into reals, just to forget about the
## differences between the two of them here as the user can deal
## with that in the very rare cases where it matters. jsonlite
## does bring `1` as an integer vector, which is rarely what we
## want, and it deparses poorly.
if (type == "number" && any(mode == "integer")) {
value[mode == "integer"] <- lapply(value[mode == "integer"], as.numeric)
mode[mode == "integer"] <- "double"
}
## We can use the typed missing value here to test against
## storage.mode, to fill in missing values and also to pass
## through to vapply to get a vector of the expected type out:
missing <- switch(type,
string = NA_character_,
number = NA_real_,
boolean = NA,
time = NA_real_)
err <- !is_null & mode != storage.mode(missing)
if (any(err)) {
msg <- sprintf(
"Expected all values of '%s' to be %ss (or NULL)",
paste(from, collapse = "."), type)
examples <- sprintf("Found `%s` (a %s) for packet '%s'",
vcapply(value[err], deparse1, control = "keepNA"),
vcapply(value[err], storage_mode_scalar),
ids[err])
cli::cli_abort(c(msg, error_examples(examples)), call = call)
}
value[is_null] <- missing
value <- vapply(value, identity, missing)
if (type == "time") {
value <- num_to_time(value)
}
value
}
}
## Reverses the conversion in extract_convert()
storage_mode_scalar <- function(x) {
if (inherits(x, "POSIXt")) {
"time"
} else {
mode <- storage.mode(x)
switch(mode,
character = "string",
double = "number",
integer = "number",
logical = "boolean",
mode)
}
}
error_examples <- function(str, max = 3) {
if (length(str) > max) {
str <- c(str[seq_len(max)], sprintf("(...and %d more)", length(str) - max))
}
set_names(str, rep("i", length(str)))
}
Any scripts or data that you put into this service are public.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.