In theresagessler/learn2scrape: Learn To scrape web data in R

library(learnr)
library(learn2scrape)

knitr::opts_chunk$set(
  # code chunk options
  echo = TRUE
  , eval = TRUE
  , warning = FALSE
  , message = FALSE
  , cached = FALSE 
  , exercise = TRUE
  # , exercise.eval = FALSE
  , exercise.completion = TRUE
  , fig.align = "center"
  , fig.height = 4
  , fig.width = 5.5
)

Introduction

Downloading files in R is pretty easy. Actually, it is one of the tasks for which we do not even need any external packages because it can be accomplished with base-R code.

Downloading a file in R

The central function for us is download.file(). If you have never used it, check its documentation{target="_blank"}.

Arguments

To use download.file(), you need to specify:

the url you are downloading from,
the destfile (destination file) specifying where and with what name to write the downloaded file to
and the download method

The download method works a bit different on each operating system, and it's easiest to just use method "auto". If this does not work for you, check the documentation{target="_blank"} for alternatives. The same holds for the download mode.

An example

To practice, we download the APSA Diversity and Inclusion Report{target="_blank"}.

Lets first lay out the steps we need to complete to download this PDF to your local system:

specify the path to the file we want to download
specify a path where to write the downloaded file to on your local system
execute the file download

Try it yourself! Complete the following code to download the PDF.

# step 1
url <- "https://www.apsanet.org/Portals/54/diversity%20and%20inclusion%20prgms/DIV%20reports/Diversity%20Report%20Executive%20-%20Final%20Draft%20-%20Web%20version.pdf"

# step 2 
# ToDo: define the file path/name where to download the PDF to
# (make sure to end it on '.pdf')
file_path <- ...

# step 3
download.file(url, file_path)

# verify
file.exists(file_path)

Hint: In step 2, you can extract the original PDF file name from the source URL using the basename() function.

Note: Because this tutorial is running with a temporary working directory, and we simply passed a file name as destination file, the PDF will be downloaded into this temporary working directory. Hence, you won't see it, e.g., in your Desktop folder. You can use file.path() to construct a proper file path instead of using just a file name as download destination.

Solution

In step 2, we first extract the PDF file name from the source URL. To do so, we use the function basename() that parses the last part (file or directory name) from a file path. (Since an URL is like a path, his works just fine.)

In addition, we use the function URLdecode() to "clean" the URL-encoded file name parsed from the URL. This makes it human-readable. (Hint: if you want to avoid white spaces in your file names, you can use the gsub() function.)

Finally, we construct the file path that determines where the file will be downloaded to on your local system. The function file.path() does this in a way that is consistent and reproducible across operation systems. In our example, we write to the 'Desktop' folder of the current user's home (see Sys.info()["user"]). ('~' is a short cut for root of the current user's file system to the current user.)

# 1. specify the URL where the file is located
source_url <- "https://www.apsanet.org/Portals/54/diversity%20and%20inclusion%20prgms/DIV%20reports/Diversity%20Report%20Executive%20-%20Final%20Draft%20-%20Web%20version.pdf"

# 2.a) extract PDF file name 
file_name <- basename(source_url) 
file_name <- URLdecode(file_name)

# 2.b) specify the file path (`fp`) where to download the file to
fp <- file.path("~", "Desktop", file_name)

# 4. download
download.file(url = source_url, destfile = fp)

# check:
file.exists(fp)

# clean up
file.remove(fp)

Downloading multiple files

But what if you want to download lots of files? Say you want to download the Congressional Record of the ongoing session{target="_blank"} in its beautiful original layout as a PDF{target="_blank"}.

Steps to complete

Let's first think about the individual steps we need to complete to achieve this:

identify the CSS selector/xpath of web elements providing links to the PDFs (e.g., using SelectorGadget)
collect all these links (using rvest functions)
for each PDF
- create a file name from the source URL
- specify a target file path
loop over PDF URLs to download them

Hands on!

Try it yourself! Implement steps 1--4 below.

Hint: In step 3 you can could define a custom function that accepts a PDF URL as single parameter. You could then use this function in step 4 to iterate over URLs.

Caution: Don't download all files. We won't use them. Just cut the vector of URLs to the first 5 or 6 elements.

Example Solution

url <- "https://www.congress.gov/congressional-record/116th-congress/browse-by-date"
page <- read_html(url)

# 1. collect PDF URLs
urls <- page %>% 
    html_elements(xpath = "//td/a[contains(@href, '.pdf')]") %>% 
    html_attr("href") %>% 
    paste0("https://www.congress.gov", .)

# 2. define function that downloads and saves PDF
#' @param url character specifying URL of PDF to be downloaded
#' @param .dir character specifying path of directory on local system to download PDF to
download_congress_record <- function(url, .dir) {
    fn <- basename(url)
    fp <- file.path(.dir, fn)
    download.file(url = url, destfile = fp, quiet = TRUE)
}

# 3. iterate over URLs to download each PDF

# create temporary directory
target_dir <- tempdir(check = TRUE)
for (url in urls[1:3]) {
  download_congress_record(url, .dir = target_dir)
}

# check 
(pdfs <- list.files(target_dir, pattern = "CREC"))

# clean up (remove all downloaded PDFs)
lapply(file.path(target_dir, pdfs), file.remove)