BiocFileCache: Managing File Resources Across Sessions

knitr::opts_chunk$set(collapse=TRUE)

Overview

Organization of files on a local machine can be cumbersome. This is especially true for local copies of remote resources that may periodically require a new download to have the most updated information available. BiocFileCache is designed to help manage local and remote resource files stored locally. It provides a convenient location to organize files and once added to the cache management, the package provides functions to determine if remote resources are out of date and require a new download.

Installation and Loading

BiocFileCache is a Bioconductor package and can be installed through BiocManager::install().

if (!"BiocManager" %in% rownames(installed.packages()))
     install.packages("BiocManager")
BiocManager::install("BiocFileCache", dependencies=TRUE)

After the package is installed, it can be loaded into R workspace by

library(BiocFileCache)

Creating / Loading the Cache

The initial step to utilizing BiocFileCache in managing files is to create a cache object specifying a location. We will create a temporary directory for use with examples in this vignette. If a path is not specified upon creation, the default location is a directory ~/.BiocFileCache in the typical user cache directory as defined by rappdirs::user_cache_dir().

path <- tempfile()
bfc <- BiocFileCache(path, ask = FALSE)

If the path location exists and has been utilized to store files previously, the previous object will be loaded with any files saved to the cache. If the path location does not exist the user will be prompted to create the new directory. If the session is not interactive to promt the user or the user decides not to create the directory a temporary directory will be used.

Some utility functions to examine the cache are:

bfccache() will show the cache path. NOTE: Because we are using temporary directories, your path location will be different than shown.

bfccache(bfc)
length(bfc)

length() on a BiocFileCache will show the number of files currently being tracked by the BiocFileCache. For more detailed information on what is store in the BiocFileCache object, there is a show method which will display the object, object class, cache path, and number of items currently being tracked.

bfc

bfcinfo() will list a table of BiocFileCache resource files being tracked in the cache. It returns a dplyr object of class tbl_sqlite.

bfcinfo(bfc)

The table of resource files includes the following information:

Now that we have created the cache object and location, let's explore adding files that the cache will manage!

Adding / Tracking Resources

Now that a BiocFileCache object and cache location has been created, files can be added to the cache for tracking. There are two functions to add a resource to the cache:

The difference between the options: bfcnew() creates an entry for a resource and returns a filepath to save to. As there are many types of data that can be saved in many different ways, bfcnew() allows you to save any R data object in the appropriate manner and still be able to track the saved file. bfcadd() should be utilized when a file already exists or a remote resource is being accessed.

bfcnew takes the BiocFileCache object and a user specified rname and returns a path location to save data to. (optionally) you can add the file extension if you know the type of file that will be saved:

savepath <- bfcnew(bfc, "NewResource", ext=".RData")
savepath

## now we can use that path in any save function
m = matrix(1:12, nrow=3)
save(m, file=savepath)

## and that file will be tracked in the cache
bfcinfo(bfc)

bfcadd() is for existing files or remote resources. The user will still specify an rname of their choosing but also must specify a path to local file or web resource as fpath. If no fpath is given, the default is to assume the rname is also the path location. If the fpath is a local file, there are a few options for the user determined by the action argument. action will allow the user to either copy the existing file into the cache directory, move the existing file into the cache directory, or leave the file whereever it is on the local system yet still track through the cache object asis. copy and move will rename the file to the generated cache file path. If the fpath is a remote source, the source will try to be downloaded, if it is successful it will save in the cache location and track in the cache object; The original source will be added to the cache information as fpath. If the user does not want the remote resource to be downloaded initially, the argument download=FALSE may be used to delay the download but add the resource to the cache. Relative path locations may also be used, specified with rtype = "relative". This will store a relative location for the file within the cache; only actions copy and move are available for relative paths.

First let's use local files:

fl1 <- tempfile(); file.create(fl1)
add2 <- bfcadd(bfc, "Test_addCopy", fl1)                 # copy
# returns filepath being tracked in cache
add2
# the name is the unique rid in the cache
rid2 <- names(add2)

fl2 <- tempfile(); file.create(fl2)
add3 <- bfcadd(bfc, "Test2_addMove", fl2, action="move") # move
rid3 <- names(add3)

fl3 <- tempfile(); file.create(fl3)
add4 <- bfcadd(bfc, "Test3_addAsis", fl3, rtype="local",
           action="asis") # reference
rid4 <- names(add4)

file.exists(fl1)    # TRUE - copied from original location
file.exists(fl2)    # FALSE - moved from original location
file.exists(fl3)    # TRUE - left asis, original location tracked

Now let's add some examples with remote sources:

url <- "http://httpbin.org/get"
add5 <- bfcadd(bfc, "TestWeb", fpath=url)
rid5 <- names(add5)

url2<- "https://en.wikipedia.org/wiki/Bioconductor"
add6 <- bfcadd(bfc, "TestWeb", fpath=url2)
rid6 <- names(add6)

# add a remote resource but don't initially download
add7 <- bfcadd(bfc, "TestNoDweb", fpath=url2, download=FALSE)
rid7 <- names(add7)
# let's look at our BiocFileCache object now
bfc
bfcinfo(bfc)

Now that we are tracking resources, let's explore accessing their information!

Investigating / Accessing Resources

Before we get into exploring individual resources, a helper function. Most of the functions provided require the unique rid[s] assigned to a resource. The bfcadd and bfcnew return the path as a named character vector, the name of the character vector is the rid. However, you may want to access a resource that you have added some time ago.

bfcquery() will take in a key word and search across the rname, rpath, and fpath for any matching entries. The columns that are searched can be controlled with the argument field.

bfcquery(bfc, "Web")

bfcquery(bfc, "copy")

q1 <- bfcquery(bfc, "wiki")
q1
class(q1)

As you can see above bfcquery(), returns an object of class tbl_sql and can be investiaged further utilizing methods for these classes, such as the package dplyr methods. The rid can be seen in the first column of the table to be used in other functions. To get a quick count of how many objects in the cache matched the query, use bfccount().

bfccount(q1)

[ allows for subsetting of the BiocFileCache object. The output will be a BiocFileSubCache object. Users will still be able to query, remove (from the subset object only), and access resources of the subset, however the resources cannot be updated.

bfcsubWeb = bfc[paste0("BFC", 5:6)]
bfcsubWeb
bfcinfo(bfcsubWeb)

There are three methods for retrieving the BiocFileCache resource path location.

The [[ will access the rpath saved in the BiocFileCache. Retrieving this location will return the path to the local version of the resource; allowing the user to then use this path in any load/read methods most appropriate for the resource. The bfcpath() and bfcrpath() both return a named character vector also displaying the local file that can be used for retrieval. bfcpath requires rids while bfcrpath() can use rids or rnames (but not both). bfcrpath() can be used to add a resource into the cache when rnames are specified; if the element inrnamesis not found, it will try and add to the cache withbfcadd()`.

bfc[["BFC2"]]
bfcpath(bfc, "BFC2")
bfcpath(bfc, "BFC5")
bfcrpath(bfc, rids="BFC5")
bfcrpath(bfc)
bfcrpath(bfc, c("http://httpbin.org/get","Test3_addAsis"))

Managing remote resources locally involves knowing when to update the local copy of the data.

bfcneedsupdate() is a method that will check the local copy of the data's etag and last_modifed time to the etag and last_modified time of the remote resource as well as an expires time. The cache saves this information when the web resource is initially added. The expires time is checked against the current Sys.time to see if the local resource has expired. If so the resource will deem need to be updated; if unavailable or not expired will check the etag and last_modified_time. The etag information is used definitively if it is available, if it is not available it checks the last_modified time. If the resource does not have a last_modified tag either, it is undetermined. If the resource has not been download yet, it is TRUE.

Note: This function does not automatically download the remote source if it is out of date. Please see bfcdownload().

bfcneedsupdate(bfc, "BFC5")
bfcneedsupdate(bfc, "BFC6")
bfcneedsupdate(bfc)

Updating Resource Entries or Local Copy of Remote Data

Just as you could access the rpath, the local resource path can be set with

The file must exist in order to be replaced in the BiocFileCache. If the user wishes to rename, they must make a copy (or touch) the file first.

fileBeingReplaced <- bfc[[rid3]]
fileBeingReplaced

# fl3 was created when we were adding resources
fl3

bfc[[rid3]]<-fl3
bfc[[rid3]]

The user may also wish to change the rname or fpath associated with a resource in addition to the rpath. This can be done with

Again, if changing the rpath the file must exist. If a fpath is being updated, the data will be downloaded and the user will be prompted to overwrite the current file specified in rpath. If the user does not want to be prompted about overwritting of files, ask=FALSE may be used.

bfcinfo(bfc, "BFC1")
bfcupdate(bfc, "BFC1", rname="FirstEntry")
bfcinfo(bfc, "BFC1")

Now let's update a web resource

suppressPackageStartupMessages({
    library(dplyr)
})
bfcinfo(bfc, "BFC6") %>% select(rid, rpath, fpath)
bfcupdate(bfc, "BFC6", fpath=url, rname="Duplicate", ask=FALSE)
bfcinfo(bfc, "BFC6") %>% select(rid, rpath, fpath)

Lastly, remote resources may require an update if the Data is out of date (See bfcneedsupdate()). The bfcdownload function will attempt to download from the original resource saved in the cache as fpath and overwrite the out of date file rpath

The following confirms that resources need updating, and the performs the update

rid <- "BFC5"
test <- !identical(bfcneedsupdate(bfc, rid), FALSE) # 'TRUE' or 'NA'
if (test)
    bfcdownload(bfc, rid, ask=FALSE)

Adding MetaData

The following functions are provided for metadata:

Additional metadata can be added as data.frames that become tables in the sql database. The data.frame must contain a column rid that matches the rid column in the cache. Any metadata added will then be displayed when accessing the cache. Metadata is added with bfcmeta()<-. A table name must be provided as an argument. Users can add multiple metadata tables as long as the names are unique. Tables may be appended or overwritten using additional arguments append=TRUE or overwrite=TRUE.

names(bfcinfo(bfc))
meta <- as.data.frame(list(rid=bfcrid(bfc)[1:3], idx=1:3))
bfcmeta(bfc, name="resourceData") <- meta
names(bfcinfo(bfc))

The metadata tables that exist can be listed with bfcmetalist() and can be retrieved with bfcmeta().

bfcmetalist(bfc)
bfcmeta(bfc, name="resourceData")

Lastly, metadata can be removed with bfcmetaremove().

bfcmetaremove(bfc, name="resourceData")

Note:

While quick implementations of all the functions exist where if you don't specify a BiocFileCache object it will operate on BiocFileCache(), this option is not available for bfcmeta()<-. This function must always specify a BiocFileCache object by first defining a variable and then passing that variable into the function.

Example of ERROR:

bfcmeta(name="resourceData") <- meta
Error in bfcmeta(name = "resourceData") <- meta :
  target of assignment expands to non-language object

Correct implementation:

bfc <- BiocFileCache()
bfcmeta(bfc, name="resourceData") <- meta

All other functions have a default, if the BiocFileCache object is missing it will operate on the default cache BiocFileCache().

Removing Resources

Now that we have added resources, it is also possible to remove a resource.

When you remove a resource from the cache, it will also delete the local file but only if it is stored in the cache directory as given by bfccache(bfc). If it is a path to a file somewhere else on the user system, it will only be removed from the BiocFileCache object but the file not deleted.

# let's remind ourselves of our object
bfc

bfcremove(bfc, "BFC6")
bfcremove(bfc, "BFC1")

# let's look at our BiocFileCache object now
bfc

There is another helper function that may be of use:

This function will compare two things:

  1. If any rpath cannot be found (This would occur if bfcnew() is used and the path was not used to save an object)
  2. If there are files in the cache directory (bfccache(bfc)), that are not being tracked by the BiocFileCache object
# create a new entry that hasn't been used
path <- bfcnew(bfc, "UseMe")
rmMe <- names(path)
# We also have a file not being tracked because we updated rpath

bfcsync(bfc)

# you can suppress the messages and just have a TRUE/FALSE
bfcsync(bfc, FALSE)

#
# Let's do some cleaning to have a synced object
#
bfcremove(bfc, rmMe)
unlink(fileBeingReplaced)

bfcsync(bfc)

Exporting and Importing Cache

There is a helper function to export a BiocFileCache and associated files as a tar or zip archive as well as the appropriate import function.

The exportbfc function will take in a BiocFileCache object or subsetted object and create a tar or zip archive that can then be shared to other collaborators on different computer systems. The user can choose where the archive is created with outputFile; the current working directory and the name BiocFileCacheExport.tar is used as default. By default a tar archive is created, but the user can create a zip archive instead using the argument outputMethod="zip". Any additional argument to the utils::zip or utils::tar may also be utilized.

The following are some example calls:

# export entire biocfilecache
exportbfc(bfc)

# export the first 4 entries of biocfilecache
# as a compressed tar
exportbfc(bfc, rids=paste0("BFC", 1:4),
      outputFile="BiocFileCacheExport.tar.gz", compression="gzip")

# export the subsetted object of web resources as zip
sub1 <- bfc[bfcrid(bfcquery(bfc, "web", field='rtype'))]
exportbfc(sub1, outputFile = "BiocFileCacheExportWeb.zip",
      outMethod="zip")

The archive once inflated on a users system will have a fully functional copy of the sent cache. The archive can be extracted manually and the path used in the constructor BiocFileCache() or for convenience the function importbfc may be utilized. The importbfc function takes in a path to the appropriate tar or zip file, the argument archiveMethod indicating if untar or unzip should be used (the default is untar), a path to where the archive should be extracted to as exdir, and any additional arguments to the utils::untar and utils::unzip methods. The function will extract the files and load the associated BiocFileCache object into the R session.

The following are example calls to load the above example exported objects:

bfc <- importbfc("BiocFileCacheExport.tar")

bfc2 <- importbfc("BiocFileCacheExport.tar.gz", compression="gzip")

bfc3 <- importbfc("BiocFileCacheExportWeb.zip", archiveMethod="unzip")

Creating a Cache from Existing Data

There exists the following helper functions to convert existing data to a BiocFileCache:

These functions may take awhile to run if there are a lot of resources, however if the BiocFileCache is stored in a permanent location it will only need to be run once.

Create a BiocFileCache from an Existing data.frame

makeBiocFileCacheFromDataFrame takes an existing data.frame and creates a BiocFileCache object. The cache location can be specified by the cache argument. The cache must not already exist and the user will be prompted to create the location. If the user opts 'N', the cache will be created in a temporary directory and this function will have to be run again upon a new R session. The original data.frame must contain the required BiocFileCache columns rtype, rpath, and fpath as described in the section 1.2 "Creating / Loading the Cache". The optional columns rname, last_modified_time, etag and expires may also be specified in the original data.frame although are not required and will be populated with defaults if missing. For resources with rtype="local", the actionLocal will control if the local copy of the file is copied or moved to the cache location, or if it is left asis on the local system; A local copy of the file must exist if the resource is identified as rtype=local. For resources with rtype="web", actionWeb will control if the local copy of the remote file is copied or moved to the cache location. It is a requirement of BiocFileCache that all remote resources download their local copy to the cache location. A local copy of the file does not have to exist and can be downloaded into the cache at a later time. Any additional columns of the original data.frame besides those required or optional BiocFileCache columns, are separated and added to the BiocFileCache as a meta data table with the name given as metadataName. See section 1.6 on "Adding Metadata".

The following is an example data.frame with minimal columns 'rtype', 'rpath', and 'fpath' and one additional column that will become metadata 'keywords'. The 'rpath' can be NA as these are remote resources (rtype='web') that have not been downloaded yet.

tbl <- data.frame(rtype=c("web","web"),
              rpath=c(NA_character_,NA_character_),
          fpath=c("http://httpbin.org/get",
              "https://en.wikipedia.org/wiki/Bioconductor"),
              keywords = c("httpbin", "wiki"), stringsAsFactors=FALSE)
tbl
newbfc <- makeBiocFileCacheFromDataFrame(tbl,
                     cache=file.path(tempdir(),"BFC"),
                     actionWeb="copy",
                     actionLocal="copy",
                     metadataName="resourceMetadata")

Cleaning or Removing Cache

Finally, there are two function involved with cleaning or deleting the cache:

cleanbfc() will evaluate the resources in the BiocFileCache object and determine which, if any, have not been created, redownloaded, or updated in a specified number of days. If ask=TRUE, each entry that is above that threshold will ask if it should be removed from the cache object and the file deleted (only deleted if in bfccache(bfc) location). If ask=FALSE, it does not ask about each file and automatically removes and deletes the file. The default number of days is 120. If a resource has not needed any updates, this function could give a false positive. It is also does not take into account how many time the resource was loaded by retrieving the path (ie. via [[, bfcpath, bfcrpath), so may not be an accurate indication of how often the resource is utilized. Please use this function with caution.

cleanbfc(bfc)

removebfc() will remove the BiocFileCache complete from the system. Any files saved in bfccache(bfc) directory will also be deleted.

removebfc(bfc)

Note Use with caution!

Use Cases

Local cache of an internet resource

One use for BiocFileCache is to save local copies of remote resources. The benefits of this approach include reproducibility, faster access, and access (once cached) without need for an internet connection. An example is an Ensembl GTF file (also available via [AnnotationHub][])

## paste to avoid long line in vignette
url <- paste(
    "ftp://ftp.ensembl.org/pub/release-71/gtf",
    "homo_sapiens/Homo_sapiens.GRCh37.71.gtf.gz",
    sep="/")

For a system-wide cache, simply load the BiocFileCache package and ask for the local resource path (rpath) of the resource.

library(BiocFileCache)
bfc <- BiocFileCache()
path <- bfcrpath(bfc, url)

Use the path returned by bfcrpath() as usual, e.g.,

gtf <- rtracklayer::import.gff(path)

A more compact use, the first or any time, is

gtf <- rtracklayer::import.gff(bfcrpath(BiocFileCache(), url))

Ensembl releases do not change with time, so there is no need to check whether the cached resource needs to be updated.

Cache of experimental computations

One might use BiocFileCache to cache results from experimental analysis. The rname field provides an opportunity to provide descriptive metadata to help manage collections of resources, without relying on cryptic file naming conventions.

Here we create or use a local file cache in the directory in which we are doing our analysis.

library(BiocFileCache)
bfc <- BiocFileCache("~/my-experiment/results")

We perform our analysis...

suppressPackageStartupMessages({
    library(DESeq2)
    library(airway)
})
data(airway)
dds <- DESeqDataData(airway, design = ~ cell + dex)
result <- DESeq(dds)

...and then save our result in a location provided by BiocFileCache.

saveRDS(result, bfcnew(bfc, "airway / DESeq standard analysis"))

Retrieve the result at a later date

result <- readRDS(bfcrpath(bfc, "airway / DESeq standard analysis"))

Once might imagine the following workflow:

suppressPackageStartupMessages({
    library(BiocFileCache)
    library(rtracklayer)
})

# load the cache
path <- file.path(tempdir(), "tempCacheDir")
bfc <- BiocFileCache(path)

# the web resource of interest
url <- "ftp://ftp.ensembl.org/pub/release-71/gtf/homo_sapiens/Homo_sapiens.GRCh37.71.gtf.gz"

# check if url is being tracked
res <- bfcquery(bfc, url)

if (bfccount(res) == 0L) {

    # if it is not in cache, add
    ans <- bfcadd(bfc, rname="ensembl, homo sapien", fpath=url)

} else {

  # if it is in cache, get path to load
  rid = res %>% filter(fpath == url) %>% collect(Inf) %>% `[[`("rid")
  ans <- bfcrpath(bfc, rid)

  # check to see if the resource needs to be updated
  check <- bfcneedsupdate(bfc, rid)
  # check can be NA if it cannot be determined, choose how to handle
  if (is.na(check)) check <- TRUE
  if (check){
    ans < - bfcdownload(bfc, rid)
  }
}

# ans is the path of the file to load
ans

# we know because we search for the url that the file is a .gtf.gz,
# if we searched on other terms we can use 'bfcpath' to see the
# original fpath to know the appropriate load/read/import method
bfcpath(bfc, names(ans))

temp = GTFFile(ans)
info = import(temp)
#
# A simplier test to see if something is in the cache
# and if not start tracking it is using `bfcrpath`
#

suppressPackageStartupMessages({
    library(BiocFileCache)
    library(rtracklayer)
})

# load the cache
path <- file.path(tempdir(), "tempCacheDir")
bfc <- BiocFileCache(path, ask=FALSE)

# the web resources of interest
url <- "ftp://ftp.ensembl.org/pub/release-71/gtf/homo_sapiens/Homo_sapiens.GRCh37.71.gtf.gz"

url2 <- "ftp://ftp.ensembl.org/pub/release-71/gtf/rattus_norvegicus/Rattus_norvegicus.Rnor_5.0.71.gtf.gz"

# if not in cache will download and create new entry
pathsToLoad <- bfcrpath(bfc, c(url, url2))

pathsToLoad

# now load files as see fit
info = import(GTFFile(pathsToLoad[1]))
class(info)
summary(info)
#
# One could also imagine the following:
#

library(BiocFileCache)

# load the cache
bfc <- BiocFileCache()

#
# Do some work!
#

# add a location in the cache
filepath <- bfcnew(bfc, "R workspace")

save(list = ls(), file=filepath)

# now the R workspace is being tracked in the cache

Cache to manage package data

A package may desire to use BiocFileCache to manage remote data. The following is example code providing some best practice guidelines.

  1. Creating the cache

Assumingly, the cache could potentially be called in a variety of places within code, examples, and vignette. It is desirable to have a wrapper to the BiocFileCache constructor. The following is a suggested example for a package called MyNewPackage:

.get_cache <-
    function()
{
    cache <- rappdirs::user_cache_dir(appname="MyNewPackage")
    BiocFileCache::BiocFileCache(cache)
}

Essentially this will create a unique cache for the package. If run interactively, the user will have the option to permanently create the package cache, else a temporary directory will be used.

  1. Resources in the cache

Managing remote resources then involves a function that will query to see if the resource has been added, if it is not it will add to the cache and if it has it checks if the file needs to be updated.

download_data_file <-
    function( verbose = FALSE )
{
    fileURL <- "http://a_path_to/someremotefile.tsv.gz"

    bfc <- .get_cache()
    rid <- bfcquery(bfc, "geneFileV2", "rname")$rid
    if (!length(rid)) {
     if( verbose )
         message( "Downloading GENE file" )
     rid <- names(bfcadd(bfc, "geneFileV2", fileURL ))
    }
    if (!isFALSE(bfcneedsupdate(bfc, rid)))
    bfcdownload(bfc, rid)

    bfcrpath(bfc, rids = rid)
}

Processing web resources before caching

A case has been identified where it may be desired to do some processing of web-based resources before saving the resource in the cache. This can be done through specific options of the bfcadd() and bfcdownload() functions.

  1. Add the resource with bfcadd() using the download=FALSE argument.
  2. Download the resource with bfcdownload() using the FUN argument.

The FUN argument is the name of a function to be applied before saving the downloaded file into the cache. The default is file.rename, simply copying the downloaded file into the cache. A user-supplied function must take ONLY two arguments. When invoked, the arguments will be:

  1. character(1) A temporary file containing the resource as retrieved from the web.
  2. character(1) The BiocFileCache location where the processed file should be saved.

The function should return a TRUE on success or a character(1) description for failure on error. As an example:

url <- "http://bioconductor.org/packages/stats/bioc/BiocFileCache/BiocFileCache_stats.tab"

headFile <-                         # how to process file before caching
    function(from, to)
{
    dat <- readLines(from)
    writeLines(head(dat), to)
    TRUE
}

rid <- bfcquery(bfc, url, "fpath")$rid
if (!length(rid))                   # not in cache, add but do not download
    rid <- names(bfcadd(bfc, url, download = FALSE))

update <- bfcneedsupdate(bfc, rid)  # TRUE if newly added or stale
if (!isFALSE(update))               # download & process
    bfcdownload(bfc, rid, ask = FALSE, FUN = headFile)

rpath <- bfcrpath(bfc, rids=rid)    # path to processed result
readLines(rpath)                    # read processed result

Note: By default bfcadd uses the webfile name as the saved local file. If the processing step involves saving the data in a different format, utilize the bfcadd argument ext to assign an extension to identify the type of file that was saved. For example

url = "http://httpbin.org/get"
bfcadd("myfile", url, download=FALSE)
# would save a file `<uniqueid>_get` in the cache
bfcadd("myfile", url, download=FALSE, ext=".Rdata")
# would save a file `<uniqueid>_get.Rdata` in the cache

Summary

It is our hope that this package allows for easier management of local and remote resources.



Try the BiocFileCache package in your browser

Any scripts or data that you put into this service are public.

BiocFileCache documentation built on Nov. 8, 2020, 5:06 p.m.