In gaborcsardi/async: Asynchronous Computation and 'I/O'

library(async)
library(desc)
library(jsonlite)

Find the fastest CRAN mirror in the US

Get the URLs of all US CRAN mirrors, we'll use this later.

mirrors <- getCRANmirrors()
mirror_urls <- mirrors$URL[mirrors$CountryCode == "us"]

response_time() will be an async function that returns the total response time of HEAD request to a single URL.

http_stop_for_status() is similar to httr::stop_for_status(), it throws an error if the status code is 400 or higher. Should this (or other) error happen, we will just return Inf for that URL.

We also add the URL to the result, as a name.

response_time <- async(function(url) {
  http_head(url)$
    then(http_stop_for_status)$
    then(function(x) setNames(x[["times"]][["total"]], url))$
    catch(error = function(e) setNames(Inf, url))
})

Let's try it, with a real CRAN URL, and a test one that returns an HTTP 404 status code.

synchronise(response_time(mirror_urls[[1]]))
synchronise(response_time("https://httpbin.org/status/404"))

fastest_urls() will run response_time() on all supplied URLs, and stop as soon as 10 of them respond. Then we sort the results, so the fastest response is first.

fastest_urls <- async(function(urls) {
  reqs <- lapply(urls, response_time)
  when_some(10, .list = reqs)$
    then(function(x) sort(unlist(x)))
})

Let's run it on the real data.

synchronise(fastest_urls(mirror_urls))

Resolve packages from GitHub and URLs

By resolving we mean 1. getting the DESCRIPTION file of the package to be able to look up dependencies, and other metadata, and 2. getting a download link, in case we want to download the package later.

For this example we'll implement two package types: 1. package on GitHub, and 2. package file at a URL.

We start with GitHub packages, these are more complicated. First, we need to set the Accept HTTP header when we talk to the GitHub API, so define a helper function for that. We also set a GitHub token to avoid GitHub rate limits, this should be in the GITHUB_PAT environment variable.

default_gh_headers <- function() {
  headers <- c("Accept" = "application/vnd.github.v3+json")
  pat <- Sys.getenv("GITHUB_PAT", "")
  if (pat != "") {
    headers <- c(headers, Authorization = paste("token", pat))
  }
  headers
}

Getting the DESCRIPTION file is easy, we just need to hit the right API endpoint, and then extract the response. desc::desc() creates a nice desc::description object that can be queried.

We use http_stop_for_status() to error out, if the any error happens, e.g. the repo or the file does not exist, and we cannot access it, etc. We don't catch this (or other) error here, so get_gh_description() will throw an error to its $then() (or $when_any() or $when_all()) child, or to synchronise().

get_gh_description <- async(function(user, repo)  {
  desc_url <- paste0(
    "https://raw.githubusercontent.com/", user, "/", repo,
    "/HEAD/DESCRIPTION")
  http_get(desc_url, headers = default_gh_headers())$
    then(http_stop_for_status)$
    then(function(resp) rawToChar(resp$content))$
    then(function(txt) desc::desc(text = txt))
})

Test:

synchronise(get_gh_description("jeroen", "curl"))

Test non-existing repos, and other errors:

synchronise(get_gh_description("jeroen", "curlqwerty"))

synchronise(get_gh_description("git", "git"))

The next thing we need is the SHA of the commit we want to download. For simplicity we always use the last commit of default branch here. This function is very similar to the previous one otherwise.

get_gh_sha <- async(function(user, repo) {
  commit_url <- paste0(
    "https://api.github.com/repos/", user, "/", repo, "/git/trees/HEAD")
  http_get(commit_url, headers = default_gh_headers())$
    then(http_stop_for_status)$
    then(function(resp) {
      cdata <- jsonlite::fromJSON(rawToChar(resp$content),
                                  simplifyVector = FALSE)
      cdata$sha
    })
})

Test:

synchronise(get_gh_sha("jeroen", "curl"))

A small helper function to get a GitHub download URL from the user name, repo name and the sha hash:

get_gh_download_url <- function(user, repo, sha) {
  paste0("https://api.github.com/repos/", user, "/", repo,
         "/zipball/", sha)
}

We are ready to write the GitHub resolver now. It will take a slug, i.e. the user/repo form, and then query both the DESCRIPTION file and the SHA hash. when_all() resolves if both of them are done, and results a named list of both pieces of data. Then we just create the download URL, and simply return it with the already parsed DESCRIPTION.

resolve_gh <- async(function(slug) {
  slug <- strsplit(slug, "/")[[1]]
  user <- slug[[1]]
  repo <- slug[[2]]
  desc <- get_gh_description(user, repo)
  sha  <- get_gh_sha(user, repo)
  when_all(desc = desc, sha = sha)$
    then(function(x) {
      list(
        url = get_gh_download_url(user, repo, x$sha),
        description = x$desc)
    })
})

Test:

synchronise(resolve_gh("jeroen/curl"))

Resolving a package at an HTTP URL is much simpler, it needs a single HTTP request, because we simply need to download the package to get the DESCRIPTION file. We place the downloaded package in a temporary file (with the same file name), and return a file:// URL to this file.

resolve_url <- async(function(url) {
  dir.create(tmpdir <- tempfile())
  dest <- file.path(tmpdir,  basename(url))
  http_get(url, file = dest)$
    then(http_stop_for_status)$
    then(function() {
      dest <- normalizePath(dest)
      list(
        url = paste("file://", normalizePath(dest)),
        description = desc::desc(dest))
    })
})

Test:

curl20_url <- "https://cloud.r-project.org/src/contrib/Archive/curl/curl_2.0.tar.gz"
synchronise(resolve_url(curl20_url))

Now combine the two functions:

res <- synchronise(when_all(
  resolve_gh("jeroen/curl"),
  resolve_gh("ropensci/magick"),
  resolve_url(curl20_url)
))
length(res)

Errors bubble up to the top level, other downloads are cancelled automatically (unless they are already completed of course), and an error is thrown from synchronise().

tryCatch(
  res <- synchronise(when_all(
    resolve_gh("jeroen/curl"),
    resolve_gh("ropensci/magick"),
    resolve_url(curl20_url),
    resolve_url("https://httpbin.org/delay/2")
  )),
  error = function(e) e
)

But we can also handle errors locally, at any point we wish. E.g. we can simply return NULL values for the failed packages.

res <- synchronise(when_all(
  resolve_gh("jeroen/curl"),
  resolve_gh("ropensci/magickfoooooobar")$catch(error = function(e) NULL),
  resolve_url(curl20_url),
  resolve_url("https://httpbin.org/status/401")$catch(error = function(e) NULL)
))
res[[1]]$description$get("Package")
res[[2]]
res[[3]]$description$get("Version")
res[[4]]