eval <- tryCatch({
    ## pre-requisites for computation of this vignette
    Gen3::authenticate()
    AnVIL::avworkspace_namespace('bioconductor-rpci-anvil')
    AnVIL::avworkspace_name('Bioconductor-Gen3-demo')
    TRUE
}, error = function(...) {
    FALSE
})
knitr::opts_chunk$set(eval = eval)
options(width = getOption("width") - 4L)

Introduction

The goal of this vignette is to illustrate how Gen3 can be queried for inforamtion about sequencing-related files associated with samples, and how this information can be used to copy files to workspace buckets (e.g., for ready access in workspaces that do not otherwise require Gen3 access) or runtime instances (e.g., for analysis of the file itself).

We assume familiarity with the 'Introduction to Gen3 in AnVIL' vignette included in this package.

Setup

## Ensure latest software versions
pkgs <- c("Bioconductor/Gen3", "Bioconductor/AnVIL")
BiocManager::install(pkgs)

Load the Gen3 and dplyr libraries.

library(Gen3)
library(dplyr)

Obtain Gen3 files

Authenticate and navigate projects

Start by authenticating, in the AnVIL environment or using the gcloud command-line API (internally, authenticate() uses the access token belonging to the active account returned by the command line gcloud auth list, so one should arrange for the active account to match the account registered with AnVIL / Terra).

authenticate()

Discover Gen3 projects available to your AnVIL account.

projects()

Discover available sequencing-related files

Query all projects for information about sequencing-related files. The first line represents fields of general relevance, the second line includes the object_id (essential for finding the location of associated files) as well as more human-friedly information about the files.

v <- values(
    "sequencing", "id", "project_id", "data_category",
    "object_id", "file_name", "file_size", "file_state",
    .n = 0
)
print(v)

For demonstration purposes we'll find the smallest file belonging to the open access 1000 Genomes project

smallest <-
    v %>%
    filter(project_id == "open_access-1000Genomes") %>%
    arrange(file_size) %>%
    select(object_id, file_name, file_size) %>%
    head(1)
smallest

More relevant might be the VCF files (and VCF file indexes) in the project.

vcf <-
    v %>%
    filter(endsWith(file_name, "vcf.gz") | endsWith(file_name, "vcf.gz.tbi"))
vcf

Again we could identify the smallest vcf file

smallest_vcf <-
    vcf %>%
    filter(endsWith(file_name, "vcf.gz")) %>%
    arrange(file_size) %>%
    select(object_id, file_name, file_size) %>%
    head(1)
smallest_vcf %>%
    t() %>%
    print()

Obtaining information about files associated with objects

N.B.: there is little value in copying files for no purpose; only copy files when the entire file must be located within the workspace, typically on the runtime compute instance.

The key information relating the metadata about the file with the file itself is the object_id. For the smallest file, the object_id is

object_id <-
    smallest %>%
    pull(object_id)
print(object_id)

Use the object_id to learn about the file, e.g., it's location, creation time, length (size) and type of content.

download_stat(object_id) %>%
    print()

For the smallest VCF file, we have

download_stat(smallest_vcf %>% pull(object_id))

Download files to local disk or to other Google buckets

Use download_object_id() to download objects to local disk by providing the name of a file or directory for the download. Files cannot already exist at that location.

tmp <- tempfile(); dir.create(tmp) # create a temporary direction
fl <- download_object_id(object_id, tmp)
print(fl) # file inside tmp

file.info(fl) %>%
    as_tibble(rownames = "file_name") %>%
    mutate(object_id = object_id, file_name = basename(file_name)) %>%
    select(object_id, everything())

Existing files cannot be over-written.

download_object_id(object_id, tmp)

The destination does not need to be a local file; it could instead be another google bucket, e.g., the bucket associated with the workspace

download_object_id(object_id, AnVIL::avbucket())
AnVIL::avfiles_ls()

Creating AnVIL entity tables summarizing Gen3 data

FIXME -- make vcf 'permanent' by creating a table of vcf 'entities' in the Terra Workspace GUI.

Session information

sessionInfo()


Bioconductor/Gen3 documentation built on Aug. 13, 2020, 4:13 p.m.