The goal of heapsofpapers
is to make it easy to respectfully get,
well, heaps of papers (and CSVs, and websites, and similar). For
instance, you may want to understand the state of open code and open
data across a bunch of different pre-print repositories, e.g. Collins
and Alexander, 2021, and in that case you need a way to quickly download
thousands of PDFs.
Essentially, the main function in the package,
heapsofpapers::get_and_save()
is a wrapper around a for
loop and
utils::download.file()
, but there are a bunch of small things that
make it handy to use instead of rolling your own each time. For
instance, the package automatically slows down your requests, lets you
know where it is up to, and adjusts for papers that you’ve already
downloaded.
You can intall the released version of heapsofpapers
from CRAN:
install.packages("heapsofpapers")
or the development version from GitHub with:
devtools::install_github("RohanAlexander/heapsofpapers")
Here is an example of getting two papers from
SocArXiv, using the main function
heapsofpapers::get_and_save()
:
library(heapsofpapers)
two_pdfs <-
tibble::tibble(
locations_are = c("https://osf.io/preprints/socarxiv/z4qg9/download",
"https://osf.io/preprints/socarxiv/a29h8/download"),
save_here = c("competing_effects_on_the_average_age_of_infant_death.pdf",
"cesr_an_r_package_for_the_canadian_election_study.pdf")
)
heapsofpapers::get_and_save(
data = two_pdfs,
links = "locations_are",
save_names = "save_here"
)
By default, the papers are downloaded into a folder called ‘heaps_of’. You could also specify the directory, for instance, if you would prefer a folder called ‘inputs’. Regardless, if the folder doesn’t exist then you’ll be asked whether you want to create it.
heapsofpapers::get_and_save(
data = two_pdfs,
links = "locations_are",
save_names = "save_here",
dir = "inputs"
)
Let’s say that you had already downloaded some PDFs, but weren’t sure
and didn’t want to download them again. You could use
heapsofpapers::check_for_existence()
to check.
heapsofpapers::check_for_existence(data = two_pdfs,
save_names = "save_here")
If you already have some of the files then
heapsofpapers::get_and_save()
allows you to ignore those files, and
not download them again, by specifying that dupe_strategy = "ignore"
.
heapsofpapers::get_and_save(
data = two_pdfs,
links = "locations_are",
save_names = "save_here",
dupe_strategy = "ignore"
)
There are many packages that are designed for scraping websites for
instance, polite
and
rvest
. Those packages are more general
and more useful in a wider range of scenarios than ours is. Ours is
focused on the specific use-case where you have a large list of items
that you need to download.
Please cite the package if you use it: Alexander, Rohan, and A Mahfouz, 2021, ‘heapsofpapers: Easily get heaps of papers’, 24 April, https://github.com/RohanAlexander/heapsofpapers.
We thank Alex Luscombe, Amy Farrow, Edward Morgan, Gregor Seyer, Monica Alexander, Paul A. Hodgetts, Sharla Gelfand, Thomas William Rosenthal, and Tom Cardoso for their help.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.