knitr::opts_chunk$set(eval = FALSE)
library(piggyback) library(magrittr)
piggyback
?piggyback
grew out of the needs of students both in my classroom and in my research
group, who frequently need to work with data files somewhat larger than one can
conveniently manage by committing directly to GitHub. As we frequently want to
share and run code that depends on >50MB data files on each of our own machines,
on continuous integration, and on larger computational servers, data sharing
quickly becomes a bottleneck.
GitHub allows repositories to attach files of up to 2 GB each to releases as a way to distribute large files associated with the project source code. There is no limit on the number of files or bandwidth to deliver them.
No authentication is required to download data from public GitHub repositories
using piggyback
. Nevertheless, we recommends setting a token when possible to
avoid rate limits. To upload data to any repository, or to download data from
private repositories, you will need to authenticate first.
piggyback
uses the same GitHub Personal Access Token (PAT) that devtools,
usethis, and friends use (gh::gh_token()
). The current best practice for
managing your GitHub credentials is detailed in this
usethis vignette.
You can also add the token as an environment variable, which may be useful in situations where you use piggyback non-interactively (i.e. automated scripts). Here are the relevant steps:
usethis::use_git_ignore(".Renviron")
to update your gitignore - this
prevents accidentally committing your token to GitHubusethis::edit_r_environ("project")
to open the Renviron file, and then
add your token, e.g. GITHUB_PAT=ghp_a1b2c3d4e5f6g7
Sys.setenv(GITHUB_PAT = "ghp_a1b2c3d4e5f6g7")
in your console for adhoc
usage. Avoid adding this line to your R scripts -- remember, the goal here is
to avoid writing your private token in any file that might be shared, even
privately.Download a file from a release:
pb_download( file = "iris2.tsv.gz", dest = tempdir(), repo = "cboettig/piggyback-tests", tag = "v0.0.1" ) #> ℹ Downloading "iris2.tsv.gz"... #> |======================================================| 100% fs::dir_tree(tempdir()) #> /tmp/RtmpWxJSZj #> └── iris2.tsv.gz
Some default behaviors to know about:
repo
argument in most piggyback functions will default to detecting the
relevant GitHub repo based on your current working directory's git configs,
so in many cases you can omit the repo
argument.tag
argument in most functions defaults to "latest", which typically
refers to the most recently created release of the repository, unless there
is a release specifically named "latest" or if you have marked a different
release as "latest" via the GitHub UI. dest
argument defaults to your current working directory ("."
). We
use tempdir()
to meet CRAN policies for the purposes of examples.file
argument in pb_download
defaults to NULL, which will download
all files connected to a given release:pb_download( repo = "cboettig/piggyback-tests", tag = "v0.0.1", dest = tempdir() ) #> ℹ Downloading "diamonds.tsv.gz"... #> |======================================================| 100% #> ℹ Downloading "iris.tsv.gz"... #> |======================================================| 100% #> ℹ Downloading "iris.tsv.xz"... #> |======================================================| 100% fs::dir_tree(tempdir()) #> /tmp/RtmpWxJSZj #> ├── diamonds.tsv.gz #> ├── iris.tsv.gz #> ├── iris.tsv.xz #> └── iris2.tsv.gz
use_timestamps
argument defaults to TRUE - notice that above,
iris2.tsv.gz
was not downloaded. If use_timestamps
is TRUE, pb_download()
will compare the local file timestamp against the GitHub file timestamp, and
only download the file if it has changed.pb_download()
also includes arguments to control the progress bar or if any
particular files should not be downloaded.
Sometimes it is preferable to have a URL from which the data can be read in directly.
These URL can then be passed into another R function, which can be more elegant
and performant than having to first download the files locally. Enter pb_download_url()
:
pb_download_url(repo = "cboettig/piggyback-tests", tag = "v0.0.1") #> [1] "https://github.com/cboettig/piggyback-tests/releases/download/v0.0.1/diamonds.tsv.gz" #> [2] "https://github.com/cboettig/piggyback-tests/releases/download/v0.0.1/iris.tsv.gz" #> [3] "https://github.com/cboettig/piggyback-tests/releases/download/v0.0.1/iris.tsv.xz" #> [4] "https://github.com/cboettig/piggyback-tests/releases/download/v0.0.1/iris2.tsv.gz"
By default, this function returns the same download URL that you would get by visiting the release page, right-clicking on the file, and copying the link (aka the "browser_download_url"). This URL is served by GitHub's web servers and not its API servers, and therefore not as restrictive with rate-limiting.
However, this URL is not accessible for private repositories, since the auth
tokens are handled by the GitHub API. You can retrieve the API download url for
private repositories by passing in "api"
to the url_type
argument:
pb_download_url(repo = "cboettig/piggyback-tests", tag = "v0.0.1", url_type = "api") #> [1] https://api.github.com/repos/cboettig/piggyback-tests/releases/assets/44261315 #> [2] https://api.github.com/repos/cboettig/piggyback-tests/releases/assets/41841778 #> [3] https://api.github.com/repos/cboettig/piggyback-tests/releases/assets/18538636 #> [4] https://api.github.com/repos/cboettig/piggyback-tests/releases/assets/8990141
pb_download_url
otherwise shares similar default behaviors with pb_download
for the file
, repo
, and tag
arguments.
piggyback
supports several general patterns for reading data into R, with
increasing degrees of performance/efficiency (and complexity):
pb_download()
files to disk and then reading files with a function that reads
from disk into memorypb_download_url()
a set of URLs and then passing those URLs to a function that
retrieves those URLs directly into memoryWe recommend the latter two approaches in cases where performance and efficiency matter, and have some vignettes with examples: - cloud native workflows - disk native workflows
pb_read()
is a wrapper on the first pattern - it downloads the file to a temp
file, then reads that file into memory, then deletes the temporary file. It
works for both public and private repositories, handling authentication under
the hood:
pb_read("mtcars.rds", repo = "tanho63/piggyback-private") #> # A data.frame: 32 × 11 #> mpg cyl disp hp drat wt qsec vs am gear carb #> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> #> 1 21 6 160 110 3.9 2.62 16.5 0 1 4 4 #> 2 21 6 160 110 3.9 2.88 17.0 0 1 4 4 #> 3 22.8 4 108 93 3.85 2.32 18.6 1 1 4 1 #> 4 21.4 6 258 110 3.08 3.22 19.4 1 0 3 1 #> 5 18.7 8 360 175 3.15 3.44 17.0 0 0 3 2 #> # ℹ 27 more rows #> # ℹ 1 more variable: carb <dbl> pb_read("mtcars.parquet", repo = "tanho63/piggyback-private") #> # A data.frame: 32 × 11 #> mpg cyl disp hp drat wt qsec vs am gear carb #> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> #> 1 21 6 160 110 3.9 2.62 16.5 0 1 4 4 #> 2 21 6 160 110 3.9 2.88 17.0 0 1 4 4 #> 3 22.8 4 108 93 3.85 2.32 18.6 1 1 4 1 #> 4 21.4 6 258 110 3.08 3.22 19.4 1 0 3 1 #> 5 18.7 8 360 175 3.15 3.44 17.0 0 0 3 2 #> # ℹ 27 more rows #> # ℹ 1 more variable: carb <dbl>
By default, pb_read
is programmed to use the following read_function
for the
corresponding file extensions:
utils::read.csv()
utils::read.delim()
readRDS()
jsonlite::fromJSON()
arrow::read_parquet()
readLines()
If a file extension is not on this list, pb_read
will raise an error and ask
you to provide a read_function
- you can also use this parameter to override
the default read_function
yourself:
pb_read( file = "play_by_play_2023.qs", repo = "nflverse/nflverse-data", tag = "pbp", read_function = qs::qread ) #> # A tibble: 42,251 × 372 #> play_id game_id old_game_id home_team away_team season_type week posteam #> <dbl> <chr> <chr> <chr> <chr> <chr> <int> <chr> #> 1 1 2023_01_ARI_W… 2023091007 WAS ARI REG 1 NA #> 2 39 2023_01_ARI_W… 2023091007 WAS ARI REG 1 WAS #> 3 55 2023_01_ARI_W… 2023091007 WAS ARI REG 1 WAS #> 4 77 2023_01_ARI_W… 2023091007 WAS ARI REG 1 WAS #> 5 102 2023_01_ARI_W… 2023091007 WAS ARI REG 1 WAS #> # ℹ 42,246 more rows #> # ℹ 364 more variables: posteam_type <chr>, defteam <chr>, side_of_field <chr>, #> # yardline_100 <dbl>, game_date <chr>, quarter_seconds_remaining <dbl>, #> # half_seconds_remaining <dbl>, game_seconds_remaining <dbl>, game_half <chr>, #> # quarter_end <dbl>, drive <dbl>, sp <dbl>, qtr <dbl>, down <dbl>, #> # goal_to_go <dbl>, time <chr>, yrdln <chr>, ydstogo <dbl>, ydsnet <dbl>, #> # desc <chr>, play_type <chr>, yards_gained <dbl>, shotgun <dbl>, …
Any read_function
can be provided so long as it accepts the filename as the
first argument, and you can pass any additional parameters via ...
:
pb_read( file = "play_by_play_2023.csv", n_max = 10, repo = "nflverse/nflverse-data", tag = "pbp", read_function = readr::read_csv ) #> # A tibble: 10 × 372 #> play_id game_id old_game_id home_team away_team season_type week posteam #> <dbl> <chr> <chr> <chr> <chr> <chr> <int> <chr> #> 1 1 2023_01_ARI_W… 2023091007 WAS ARI REG 1 NA #> 2 39 2023_01_ARI_W… 2023091007 WAS ARI REG 1 WAS #> 3 55 2023_01_ARI_W… 2023091007 WAS ARI REG 1 WAS #> 4 77 2023_01_ARI_W… 2023091007 WAS ARI REG 1 WAS #> 5 102 2023_01_ARI_W… 2023091007 WAS ARI REG 1 WAS #> # ℹ 5 more rows #> # ℹ 364 more variables: posteam_type <chr>, defteam <chr>, side_of_field <chr>, #> # yardline_100 <dbl>, game_date <chr>, quarter_seconds_remaining <dbl>, #> # half_seconds_remaining <dbl>, game_seconds_remaining <dbl>, game_half <chr>, #> # quarter_end <dbl>, drive <dbl>, sp <dbl>, qtr <dbl>, down <dbl>, #> # goal_to_go <dbl>, time <chr>, yrdln <chr>, ydstogo <dbl>, ydsnet <dbl>, #> # desc <chr>, play_type <chr>, yards_gained <dbl>, shotgun <dbl>, …
More efficiently, many read functions accept URLs, including read.csv()
,
arrow::read_parquet()
, readr::read_csv()
, data.table::fread()
, and
jsonlite::fromJSON()
, so reading in one file can be done by passing along the
output of pb_download_url()
:
pb_download_url("mtcars.csv", repo = "tanho63/piggyback-tests", tag = "v0.0.2") %>% read.csv() #> # A data.frame: 32 × 12 #> X mpg cyl disp hp drat wt qsec vs am gear #> <chr> <dbl> <int> <dbl> <int> <dbl> <dbl> <dbl> <int> <int> <int> #> 1 Mazda… 21 6 160 110 3.9 2.62 16.5 0 1 4 #> 2 Mazda… 21 6 160 110 3.9 2.88 17.0 0 1 4 #> 3 Datsu… 22.8 4 108 93 3.85 2.32 18.6 1 1 4 #> 4 Horne… 21.4 6 258 110 3.08 3.22 19.4 1 0 3 #> 5 Horne… 18.7 8 360 175 3.15 3.44 17.0 0 0 3 #> # ℹ 27 more rows #> # ℹ 1 more variable: carb <int> #> # ℹ Use `print(n = ...)` to see more rows
Some functions also accept URLs when converted into a connection by wrapping it
in url()
, e.g. for readRDS()
:
pb_url <- pb_download_url("mtcars.rds", repo = "tanho63/piggyback-tests", tag = "v0.0.2") %>% url() readRDS(pb_url) #> # A data.frame: 32 × 11 #> mpg cyl disp hp drat wt qsec vs am gear carb #> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> #> 1 21 6 160 110 3.9 2.62 16.5 0 1 4 4 #> 2 21 6 160 110 3.9 2.88 17.0 0 1 4 4 #> 3 22.8 4 108 93 3.85 2.32 18.6 1 1 4 1 #> 4 21.4 6 258 110 3.08 3.22 19.4 1 0 3 1 #> 5 18.7 8 360 175 3.15 3.44 17.0 0 0 3 2 #> # ℹ 27 more rows #> # ℹ Use `print(n = ...)` to see more rows close(pb_url)
Note that using url()
requires that we close the connection after reading it,
or else we will receive warnings about leaving open connections.
This url()
approach allows us to pass along authentication for private repos,
e.g.
pb_url <- pb_download_url("mtcars.rds", repo = "tanho63/piggyback-private", url_type = "api") %>% url( headers = c( "Accept" = "application/octet-stream", "Authorization" = paste("Bearer", gh::gh_token()) ) ) readRDS(pb_url) #> # A tibble: 32 × 11 #> mpg cyl disp hp drat wt qsec vs am gear carb #> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> #> 1 21 6 160 110 3.9 2.62 16.5 0 1 4 4 #> 2 21 6 160 110 3.9 2.88 17.0 0 1 4 4 #> 3 22.8 4 108 93 3.85 2.32 18.6 1 1 4 1 #> 4 21.4 6 258 110 3.08 3.22 19.4 1 0 3 1 #> 5 18.7 8 360 175 3.15 3.44 17.0 0 0 3 2 #> # ℹ 27 more rows #> # ℹ Use `print(n = ...)` to see more rows close(pb_url)
Note that arrow
does not accept a url()
connection at this time, so you should
default to pb_read()
if using private repositories.
piggyback
uploads data to GitHub releases. If your repository doesn't have a
release yet, piggyback
will prompt you to create one - you can create a release
with:
pb_release_create(repo = "cboettig/piggyback-tests", tag = "v0.0.2") #> ✔ Created new release "v0.0.2".
Create new releases to manage multiple versions of a given data file, or to organize sets of files under a common topic. While you can create releases as often as you like, making a new release is not necessary each time you upload a file. If maintaining old versions of the data is not useful, you can stick with a single release and upload all of your data there.
Once we have at least one release available, we are ready to upload files. By
default, pb_upload
will attach data to the latest release.
## We'll need some example data first. ## Pro tip: compress your tabular data to save space & speed upload/downloads readr::write_tsv(mtcars, "mtcars.tsv.gz") pb_upload("mtcars.tsv.gz", repo = "cboettig/piggyback-tests") #> ℹ Uploading to latest release: "v0.0.2". #> ℹ Uploading mtcars.tsv.gz ... #> |===================================================| 100%
Like pb_download()
, pb_upload()
will overwrite any file of the same name already
attached to the release file by default, unless the timestamp of the previously
uploaded version is more recent. You can toggle these settings with the overwrite
parameter.
pb_upload
also accepts a vector of multiple files to upload:
library(magrittr) ## upload a folder of data list.files("data") %>% pb_upload(repo = "cboettig/piggyback-tests", tag = "v0.0.1") ## upload certain file extensions list.files(pattern = c("*.tsv.gz", "*.tif", "*.zip")) %>% pb_upload(repo = "cboettig/piggyback-tests", tag = "v0.0.1")
pb_write
wraps the above process, essentially allowing you to upload directly
to a release by providing an object, filename, and repo/tag:
pb_write(mtcars, "mtcars.rds", repo = "cboettig/piggyback-tests") #> ℹ Uploading to latest release: "v0.0.2". #> ℹ Uploading mtcars.rds ... #> |===================================================| 100%
Similar to pb_read
, pb_write
has some pre-programmed write_functions
for
the following file extensions:
- ".csv", ".csv.gz", ".csv.xz" are written with utils::write.csv()
- ".tsv", ".tsv.gz", ".tsv.xz" are written with utils::write.csv(x, filename, sep = '\t')
- ".rds" is written with saveRDS()
- ".json" is written with jsonlite::write_json()
- ".parquet" is written with arrow::write_parquet()
- ".txt" is written with writeLines()
and you can pass custom functions with the write_function
parameter:
pb_write( x = mtcars, file = "mtcars.csv.gz", repo = "cboettig/piggyback-tests", write_function = data.table::fwrite ) #> ℹ Uploading to latest release: "v0.0.2". #> ℹ Uploading mtcars.csv.gz ... #> |===================================================| 100%
Delete a file from a release:
pb_delete(file = "mtcars.tsv.gz", repo = "cboettig/piggyback-tests", tag = "v0.0.1") #> ℹ Deleted "mtcars.tsv.gz" from "v0.0.1" release on "cboettig/piggyback-tests"
Note that this is irreversible unless you have a copy of the data elsewhere.
List all files currently piggybacking on a given release. Omit tag
to see
files on all releases.
pb_list(repo = "cboettig/piggyback-tests", tag = "v0.0.1") #> file_name size timestamp tag owner repo #> 1 diamonds.tsv.gz 571664 2021-09-07 23:38:31 v0.0.1 cboettig piggyback-tests #> 2 iris.tsv.gz 846 2021-08-05 20:00:09 v0.0.1 cboettig piggyback-tests #> 3 iris.tsv.xz 848 2020-03-07 06:18:32 v0.0.1 cboettig piggyback-tests #> 4 iris2.tsv.gz 846 2018-10-05 17:04:33 v0.0.1 cboettig piggyback-tests
To reduce GitHub API calls, piggyback caches pb_releases
and pb_list
with a
timeout of 10 minutes by default. This avoids repeating identical requests to
update its internal record of the repository data (releases, assets, timestamps, etc)
during programmatic use. You can increase or decrease this delay by setting the
environment variable in seconds, e.g. Sys.setenv("piggyback_cache_duration" = 3600)
for a longer cache or Sys.setenv("piggyback_cache_duration" = 0)
to disable caching,
and then restarting R.
GitHub assets attached to a release do not support file paths, and will sometimes
convert most special characters (#
, %
, etc) to .
or throw an error (e.g.
for file names containing $
, @
, /
). piggyback
will default to using the
basename()
of the file only (i.e. will only use "mtcars.csv"
if provided a
file path like "data/mtcars.csv"
)
piggyback
is not intended as a data archiving solution. Importantly, bear in
mind that there is nothing special about multiple "versions" in releases, as far
as data assets uploaded by piggyback
are concerned. The data files piggyback
attaches to a Release can be deleted or modified at any time -- creating a new
release to store data assets is the functional equivalent of just creating new
directories v0.1
, v0.2
to store your data. (GitHub Releases are always pinned
to a particular git
tag, so the code/git-managed contents associated with repo
are more immutable, but remember our data assets just piggyback on top of the repo).
Permanent, published data should always be archived in a proper data repository
with a DOI, such as zenodo.org. Zenodo can freely archive
public research data files up to 50 GB in size, and data is strictly versioned
(once released, a DOI always refers to the same version of the data, new releases
are given new DOIs). piggyback
is meant only to lower the friction of working
with data during the research process, (e.g. provide data accessible to collaborators
or continuous integration systems during research process, including for private
repositories.)
GitHub documentation at the time of writing endorses the use of attachments to releases as a solution for distributing large files as part of your project:
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.