av: TABLE, DATA, files, bucket, runtime, and disk elements

avR Documentation

TABLE, DATA, files, bucket, runtime, and disk elements

Description

avtables() describes tables available in a workspace. Tables can be visualized under the DATA tab, TABLES item. avtable() returns an AnVIL table. avtable_paged() retrieves an AnVIL table by requesting the table in 'chunks', and may be appropriate for large tables. avtable_import() imports a data.frame to an AnVIL table. avtable_import_set() imports set membership (i.e., a subset of an existing table) information to an AnVIL table. avtable_delete_values() removes rows from an AnVIL table.

avtable_import_status() queries for the status of an 'asynchronous' table import.

avdata() returns key-value tables representing the information visualized under the DATA tab, 'REFERENCE DATA' and 'OTHER DATA' items. avdata_import() updates (modifies or creates new, but does not delete) rows in 'REFERENCE DATA' or 'OTHER DATA' tables.

avbucket() returns the workspace bucket, i.e., the google bucket associated with a workspace. Bucket content can be visualized under the 'DATA' tab, 'Files' item.

avfiles_ls() returns the paths of files in the workspace bucket. avfiles_backup() copies files from the compute node file system to the workspace bucket. avfiles_restore() copies files from the workspace bucket to the compute node file system. avfiles_rm() removes files or directories from the workspace bucket.

avruntimes() returns a tibble containing information about runtimes (notebooks or RStudio instances, for example) that the current user has access to.

avruntime() returns a tibble with the runtimes associated with a particular google project and account number; usually there is a single runtime satisfiying these criteria, and it is the runtime active in AnVIL.

'avdisks()' returns a tibble containing information about persistent disks associatd with the current user.

Usage

avtables(namespace = avworkspace_namespace(), name = avworkspace_name())

avtable(
  table,
  namespace = avworkspace_namespace(),
  name = avworkspace_name(),
  na = c("", "NA")
)

avtable_paged(
  table,
  n = Inf,
  page = 1L,
  pageSize = 1000L,
  sortField = "name",
  sortDirection = c("asc", "desc"),
  filterTerms = character(),
  filterOperator = c("and", "or"),
  namespace = avworkspace_namespace(),
  name = avworkspace_name(),
  na = c("", "NA")
)

avtable_import(
  .data,
  entity = names(.data)[[1]],
  namespace = avworkspace_namespace(),
  name = avworkspace_name(),
  delete_empty_values = FALSE,
  na = "NA",
  n = Inf,
  page = 1L,
  pageSize = NULL
)

avtable_import_set(
  .data,
  origin,
  set = names(.data)[[1]],
  member = names(.data)[[2]],
  namespace = avworkspace_namespace(),
  name = avworkspace_name(),
  delete_empty_values = FALSE,
  na = "NA",
  n = Inf,
  page = 1L,
  pageSize = NULL
)

avtable_import_status(
  job_status,
  namespace = avworkspace_namespace(),
  name = avworkspace_name()
)

avtable_delete_values(
  table,
  values,
  namespace = avworkspace_namespace(),
  name = avworkspace_name()
)

avdata(namespace = avworkspace_namespace(), name = avworkspace_name())

avdata_import(
  .data,
  namespace = avworkspace_namespace(),
  name = avworkspace_name()
)

avbucket(
  namespace = avworkspace_namespace(),
  name = avworkspace_name(),
  as_path = TRUE
)

avfiles_ls(
  path = "",
  full_names = FALSE,
  recursive = FALSE,
  namespace = avworkspace_namespace(),
  name = avworkspace_name()
)

avfiles_backup(
  source,
  destination = "",
  recursive = FALSE,
  parallel = TRUE,
  namespace = avworkspace_namespace(),
  name = avworkspace_name()
)

avfiles_restore(
  source,
  destination = ".",
  recursive = FALSE,
  parallel = TRUE,
  namespace = avworkspace_namespace(),
  name = avworkspace_name()
)

avfiles_rm(
  source,
  recursive = FALSE,
  parallel = TRUE,
  namespace = avworkspace_namespace(),
  name = avworkspace_name()
)

avruntimes()

avruntime(project = gcloud_project(), account = gcloud_account())

avdisks()

Arguments

namespace

character(1) AnVIL workspace namespace as returned by, e.g., avworkspace_namespace()

name

character(1) AnVIL workspace name as returned by, eg., avworkspace_name().

table

character(1) table name as returned by, e.g., avtables().

na

in avtable() and avtable_paged(), character() of strings to be interpretted as missing values. In avtable_import() character(1) value to use for representing NA_character_. See Details.

n

numeric(1) maximum number of rows to return

page

integer(1) first page of iteration

pageSize

integer(1) number of records per page. Generally, larger page sizes are more efficient.

sortField

character(1) field used to sort records when determining page order. Default is the entity field.

sortDirection

character(1) direction to sort entities ("asc"ending or "desc"ending) when paging.

filterTerms

character(1) string literal to select rows with an exact (substring) matches in column.

filterOperator

character(1) operator to use when multiple terms in ⁠filterTerms=⁠, either "and" (default) or "or".

.data

A tibble or data.frame for import as an AnVIL table.

entity

character(1) column name of .data to be used as imported table name. When the table comes from R, this is usually a column name such as sample. The data will be imported into AnVIL as a table sample, with the sample column included with suffix ⁠_id⁠, e.g., sample_id. A column in .data with suffix ⁠_id⁠ can also be used, e.g., entity = "sample_id", creating the table sample with column sample_id in AnVIL. Finally, a value of entity that is not a column in .data, e.g., entity = "unknown", will cause a new table with name entity and entity values seq_len(nrow(.data)).

delete_empty_values

logical(1) when TRUE, remove entities not include in .data from the DATA table. Default: FALSE.

origin

character(1) name of the entity (table) used to create the set e.g "sample", "participant", etc.

set

character(1) column name of .data identifying the set(s) to be created.

member

character() vector of entity from the avtable identified by origin. The values may repeat if an ID is in more than one set

job_status

tibble() of job identifiers, returned by avtable_import() and avtable_import_set().

values

vector of values in the entity (key) column of table to be deleted. A table sample has an associated entity column with suffix ⁠_id⁠, e.g., sample_id. Rows with entity column entries matching values are deleted.

as_path

logical(1) when TRUE (default) return bucket with prefix ⁠gs://⁠ (for avbucket()) or ⁠gs://<bucket-id>⁠ (for avfiles_ls()).

path

For ⁠avfiles_ls(), the character(1) file or directory path to list. For ⁠avfiles_rm()⁠, the character() (perhaps with length greater than 1) of files or directory paths to be removed. The elements of ⁠path⁠can contain glob-style patterns, e.g.,⁠vign*'.

full_names

logical(1) return names relative to path (FALSE, default) or root of the workspace bucket?

recursive

logical(1) list files recursively?

source

character() file paths. for avfiles_backup(), source can include directory names when recursive = TRUE.

destination

character(1) a google bucket (⁠gs://<bucket-id>/...⁠) to write files. The default is the workspace bucket.

parallel

logical(1) backup files using parallel transfer? See ?gsutil_cp().

project

character(1) project (billing account) name, as returned by, e.g., gcloud_project() or avworkspace_namespace().

account

character(1) google account (email address associated with billing account), as returned by gcloud_account().

Details

Treatment of missing values in avtable(), avtable_paged() and avtable_import() are handled by the na parameter.

avtable() may sometimes result in a curl error 'Error in curl::curl_fetch_memory' or a 'Internal Server Error (HTTP 500)' This may be due to a server time-out when trying to read a large (more than 50,000 rows?) table; using avtable_paged() may address this problem.

For avtable() and avtable_paged(), the default na = c("", "NA") treats empty cells or cells containing "NA" in a Terra data table as NA_character_ in R. Use na = character() to indicate no missing values, na = "NA" to retain the distinction between "" and NA_character_.

For avtable_import(), the default na = "NA" records NA_character_ in R as the character string "NA" in an AnVIL data table.

The default setting (na = "NA" in avtable_import(), ⁠na = c("", NA_character_")⁠ in avtable(), is appropriate to 'round-trip' data from R to AnVIL and back when character vectors contain only NA_character_. Use na = "NA" in both functions to round-trip data containing both NA_character_ and "NA". Use a distinct string, e.g., na = "__MISSING_VALUE__", for both arguments if the data contains a string "NA" as well as NA_character_.

avtable_import() tries to work around limitations in .data size in the AnVIL platform, using pageSize (number of rows) to import so that approximately 1500000 elements (rows x columns) are uploaded per chunk. For large .data, a progress bar summarizes progress on the import. Individual chunks may nonetheless fail to upload, with common reasons being an internal server error (HTTP error code 500) or transient authorization failure (HTTP 401). In these and other cases avtable_import() reports the failed page(s) as warnings. The user can attempt to import these individually using the page argument. If many pages fail to import, a strategy might be to provide an explicit pageSize less than the automatically determined size.

avtable_import_set() creates new rows in a table ⁠<origin>_set⁠. One row will be created for each distinct value in the column identified by set. Each row entry has a corresponding column ⁠<origin>⁠ linking to one or more rows in the ⁠<origin>⁠ table, as given in the member column. The operation is somewhat like split(member, set).

avfiles_backup() can be used to back-up individual files or entire directories, recursively. When recursive = FALSE, files are backed up to the bucket with names approximately paste0(destination, "/", basename(source)). When recursive = TRUE and source is a directory ⁠path/to/foo/', files are backed up to bucket names that include the directory name, approximately ⁠paste0(destination, "/", dir(basename(source), full.names = TRUE))⁠. Naming conventions are described in detail in ⁠gsutil_help("cp")'.

avfiles_restore() behaves in a manner analogous to avfiles_backup(), copying files from the workspace bucket to the compute node file system.

Value

avtables(): A tibble with columns identifying the table, the number of records, and the column names.

avtable(): a tibble of data corresponding to the AnVIL table table in the specified workspace.

avtable_paged(): a tibble of data corresponding to the AnVIL table table in the specified workspace.

avtable_import() returns a tibble() containing the page number, 'from' and 'to' rows included in the page, job identifier, initial status of the uploaded 'chunks', and any (error) messages generated during status check. Use avtable_import_status() to query current status.

avtable_import_set() returns a character(1) name of the imported AnVIL tibble.

avtable_delete_values() returns a tibble representing deleted entities, invisibly.

avdata() returns a tibble with five columns: "type" represents the origin of the data from the 'REFERENCE' or 'OTHER' data menus. "table" is the table name in the REFERENCE menu, or 'workspace' for the table in the 'OTHER' menu, the key used to access the data element, the value label associated with the data element and the value (e.g., google bucket) of the element.

avdata_import() returns, invisibly, the subset of the input table used to update the AnVIL tables.

avbucket() returns a character(1) bucket identifier, prefixed with ⁠gs://⁠ if as_path = TRUE.

avfiles_ls() returns a character vector of files in the workspace bucket.

avfiles_backup() returns, invisibly, the status code of the gsutil_cp() command used to back up the files.

avfiles_rm() on success, returns a list of the return codes of gsutil_rm(), invisibly.

avruntimes() returns a tibble with columns

  • id: integer() runtime identifier.

  • googleProject: character() billing account.

  • tool: character() e.g., "Jupyter", "RStudio".

  • status character() e.g., "Stopped", "Running".

  • creator character() AnVIL account, typically "user@gmail.com".

  • createdDate character() creation date.

  • destroyedDate character() destruction date, or NA.

  • dateAccessed character() date of (first?) access.

  • runtimeName character().

  • clusterServiceAccount character() service ('pet') account for this runtime.

  • masterMachineType character() It is unclear which 'tool' populates which of the machineType columns).

  • workerMachineType character().

  • machineType character().

  • persistentDiskId integer() identifier of persistent disk (see avdisks()), or NA.

avruntime() returns a tibble witht he same structure as the return value of avruntimes().

avdisks() returns a tibble with columns

  • id character() disk identifier.

  • googleProject: character() billing account.

  • status, e.g, "Ready"

  • size integer() in GB.

  • diskType character().

  • blockSize integer().

  • creator character() AnVIL account, typically "user@gmail.com".

  • createdDate character() creation date.

  • destroyedDate character() destruction date, or NA.

  • dateAccessed character() date of (first?) access.

  • zone character() e.g.. "us-central1-a".

  • name character().

Examples

## Not run: 
## editable copy of '1000G-high-coverage-2019' workspace
avworkspace("bioconductor-rpci-anvil/1000G-high-coverage-2019")
sample <-
    avtable("sample") %>%                               # existing table
    mutate(set = sample(head(LETTERS), nrow(.), TRUE))  # arbitrary groups
sample %>%                                   # new 'participant_set' table
    avtable_import_set("participant", "set", "participant")
sample %>%                                   # new 'sample_set' table
    avtable_import_set("sample", "set", "name")

## End(Not run)

if (gcloud_exists() && nzchar(avworkspace_name())) {
    ## from within AnVIL
    data <- avdata()
    data
}

## Not run: 
avdata_import(data)

## End(Not run)

if (gcloud_exists() && nzchar(avworkspace_name()))
    ## From within AnVIL...
    bucket <- avbucket()                        # discover bucket

## Not run: 
path <- file.path(bucket, "mtcars.tab")
gsutil_ls(dirname(path))                    # no 'mtcars.tab'...
write.table(mtcars, gsutil_pipe(path, "w")) # write to bucket
gsutil_stat(path)                           # yep, there!
read.table(gsutil_pipe(path, "r"))          # read from bucket

## End(Not run)
if (gcloud_exists() && nzchar(avworkspace_name()))
    avfiles_ls()

## Not run: 
## backup all files in the current directory
## default buckets are gs://<bucket-id>/<file-names>
avfiles_backup(dir())
## backup working directory, recursively
## default buckets are gs://<bucket-id>/<basename(getwd())>/...
avfiles_backup(getwd(), recursive = TRUE)

## End(Not run)

if (gcloud_exists())
    ## from within AnVIL
    avruntimes()

if (gcloud_exists())
    ## from within AnVIL
    avdisks()


Bioconductor/AnVIL documentation built on Oct. 28, 2023, 10:17 a.m.