In giocomai/castarter: Content Analysis Starter Toolkit

knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)

One of the first issues that appear when starting a text mining or web scraping project relates to the issue of managing files and folder. castarter defaults to an opinionated folder structure that should work for most projects. It also facilitates downloading files (skipping previously downloaded files) and ensuring consistent and unique matching between a downloaded html, its source url, and data extracted from them. Finally, it facilitates archiving and backuping downloaded files and scripts.

Getting started

In this vignette, I will outline some of the basic steps that can be used to extract and process the contents of press releases from a number of institutions of the European Union. When text mining or scraping, it is common to gather quickly many thousands of file, and keeping them in good order is fundamental, particularly in the long term.

A preliminary suggestion: depending on how you usually work and keep your files backed-up it may make sense to keep your scripts in a folder that is live-synced (e.g. with services such as Dropbox, Nextcloud, or Google Drive). It however rarely make sense to live-sync tens or hundreds of thousands of files as you proceed with your scraping. You may want to keep this in mind as you set the base_folder with cas_set_options(). I will keep in the current working directory here for the sake of simplicity, but there are no drawbacks in having scripts and folders in different locations.

castarter stores details about the download process in a database. By default, this is stored locally in DuckDB database kept in the same folder as website files, but it can be stored in a different folder, or alternative database backends such as RSQlite or MySQL can also be used.

Assuming that my project on the European Union involves text mining the website of the European Parliament, the European Council, and the External action service (EEAS) the folder structure may look something like this:

library("castarter")
cas_set_options(
  base_folder = fs::path(
    fs::path_temp(),
    "castarter_data"
  ),
  project = "European Union",
  website = "EEAS"
)

fs::dir_create(path = fs::path(
  cas_get_options()$base_folder,
  cas_get_options()$project,
  "European Parliament"
))

fs::dir_create(path = fs::path(
  cas_get_options()$base_folder,
  cas_get_options()$project,
  "European Council"
))

fs::dir_create(path = fs::path(
  cas_get_options()$base_folder,
  cas_get_options()$project,
  "EEAS"
))

fs::dir_tree(cas_get_options()$base_folder)

In brief, castarter_data is the base folder where I can store all of my text mining projects. european_union is the name of the project, while all others are the names of the specific websites I will source. Folders will by created automatically as needed when you start downloading files.

Downloading index files

In text mining a common scenario involves first downloading web pages containing lists of urls to the actual posts we are interested in. In the case of the European Commission, these would probably the pages in the "news section". By clicking on the the numbers at the bottom of the page, we get to see direct links to the subsequent pages listing all posts.

These URLs look something like this:

tibble::tibble(index_urls = paste0("https://ec.europa.eu/commission/presscorner/home/en?pagenumber=", 1:10)) %>%
  knitr::kable()

Sometimes such urls can be derived from the archive section, as is the case for example for EEAS:

index_df <- cas_build_urls(
  url = "https://www.eeas.europa.eu/eeas/press-material_en?f%5B0%5D=pm_category%253AStatement/Declaration&f%5B1%5D=press_site%253AEEAS&f%5B2%5D=press_site%253AEEAS&fulltext=&created_from=&created_to=&0=press_site%253AEEAS&1=press_site%253AEEAS&2=press_site%253AEEAS&page=",
  start_page = 0,
  end_page = 3,
  index_group = "Statements"
)

index_df %>%
  knitr::kable()

All information about the download process are tyipically stored in a local database, for consistency and future reference.

cas_write_db_index(urls = index_df)

cas_read_db_index()

cas_get_base_path(create_folder_if_missing = TRUE)

# cas_get_files_to_download(index = TRUE)

Download files

[#TODO]

cas_download(index = TRUE, create_folder_if_missing = TRUE)

Check how the download is going

download_status_df <- cas_read_db_download(index = TRUE)

download_status_df

Extract links from index files

cas_extract_links(
  container = "h5",
  container_class = "card-title",
  domain = "https://www.eeas.europa.eu",
  write_to_db = TRUE
)

cas_read_db_contents_id()

cas_get_files_to_download()

cas_download(sample = 5)

cas_read_db_download()

extractors_l <- list(
  title = \(x) cas_extract_html(
    html_document = x,
    container = "h1",
    container_class = "node__title"
  ) %>%
    stringr::str_remove_all(pattern = stringr::fixed("\n")) %>%
    stringr::str_squish(),
  date = \(x) cas_extract_html(
    html_document = x,
    container = "div",
    container_class = "node__meta"
  ) %>%
    stringr::str_extract("[[:digit:]]+\\.[[:digit:]]+\\.[[:digit:]]+") %>%
    lubridate::dmy()
)