multi_download: Advanced download interface

View source: R/multi_download.R

multi_downloadR Documentation

Advanced download interface

Description

Download multiple files concurrently, with support for resuming large files. This function is based on multi_run() and hence does not error in case any of the individual requests fail; you should inspect the return value to find out which of the downloads were completed successfully.

Usage

multi_download(
  urls,
  destfiles = NULL,
  resume = FALSE,
  progress = TRUE,
  timeout = Inf,
  multiplex = FALSE,
  ...
)

Arguments

urls

vector with files to download

destfiles

vector (of equal length as urls) with paths of output files, or NULL to use basename of urls.

resume

if the file already exists, resume the download. Note that this may change server responses, see details.

progress

print download progress information

timeout

in seconds, passed to multi_run

multiplex

passed to new_pool

...

extra handle options passed to each request new_handle

Details

Upon completion of all requests, this function returns a data frame with results. The success column indicates if a request was successfully completed (regardless of the HTTP status code). If it failed, e.g. due to a networking issue, the error message is in the error column. A success value NA indicates that the request was still in progress when the function was interrupted or reached the elapsed timeout and perhaps the download can be resumed if the server supports it.

It is also important to inspect the status_code column to see if any of the requests were successful but had a non-success HTTP code, and hence the downloaded file probably contains an error page instead of the requested content.

Note that when you set resume = TRUE you should expect HTTP-206 or HTTP-416 responses. The latter could indicate that the file was already complete, hence there was no content left to resume from the server. If you try to resume a file download but the server does not support this, success if FALSE and the file will not be touched. In fact, if we request to a download to be resumed and the server responds ⁠HTTP 200⁠ instead of ⁠HTTP 206⁠, libcurl will error and not download anything, because this probably means the server did not respect our range request and is sending us the full file.

About HTTP/2

Availability of HTTP/2 can increase the performance when making many parallel requests to a server, because HTTP/2 can multiplex many requests over a single TCP connection. Support for HTTP/2 depends on the version of libcurl that your system has, and the TLS back-end that is in use, check curl_version. For clients or servers without HTTP/2, curl makes at most 6 connections per host over which it distributes the queued downloads.

On Windows and MacOS you can switch the active TLS backend by setting an environment variable CURL_SSL_BACKEND in your ⁠~/.Renviron⁠ file. On Windows you can switch between SecureChannel (default) and OpenSSL where only the latter supports HTTP/2. On MacOS you can use either SecureTransport or LibreSSL, the default varies by MacOS version.

Value

The function returns a data frame with one row for each downloaded file and the following columns:

  • success if the HTTP request was successfully performed, regardless of the response status code. This is FALSE in case of a network error, or in case you tried to resume from a server that did not support this. A value of NA means the download was interrupted while in progress.

  • status_code the HTTP status code from the request. A successful download is usually 200 for full requests or 206 for resumed requests. Anything else could indicate that the downloaded file contains an error page instead of the requested content.

  • resumefrom the file size before the request, in case a download was resumed.

  • url final url (after redirects) of the request.

  • destfile downloaded file on disk.

  • error if success == FALSE this column contains an error message.

  • type the Content-Type response header value.

  • modified the Last-Modified response header value.

  • time total elapsed download time for this file in seconds.

  • headers vector with http response headers for the request.

Examples

## Not run: 
# Example: some large files
urls <- sprintf(
  "https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2021-%02d.parquet", 1:12)
res <- multi_download(urls, resume = TRUE) # You can interrupt (ESC) and resume

# Example: revdep checker
# Download all reverse dependencies for the 'curl' package from CRAN:
pkg <- 'curl'
mirror <- 'https://cloud.r-project.org'
db <- available.packages(repos = mirror)
packages <- c(pkg, tools::package_dependencies(pkg, db = db, reverse = TRUE)[[pkg]])
versions <- db[packages,'Version']
urls <- sprintf("%s/src/contrib/%s_%s.tar.gz", mirror, packages,  versions)
res <- multi_download(urls)
all.equal(unname(tools::md5sum(res$destfile)), unname(db[packages, 'MD5sum']))
# And then you could use e.g.: tools:::check_packages_in_dir()

# Example: URL checker
pkg_url_checker <- function(dir){
  db <- tools:::url_db_from_package_sources(dir)
  res <- multi_download(db$URL, rep('/dev/null', nrow(db)), nobody=TRUE)
  db$OK <- res$status_code == 200
  db
}

# Use a local package source directory
pkg_url_checker(".")


## End(Not run)

curl documentation built on Sept. 30, 2024, 9:35 a.m.