cas_update: Update corpus
In giocomai/castarter: Content Analysis Starter Toolkit

cas_update

R Documentation

Update corpus

Description

Currently supports only update when re-downloading index urls is expected to bring new articles. It takes the first urls for each index group, and continues downloading new index pages as long as new links are found in each page. If no new link is found, it stops downloading and moves to the next index group.

Usage

cas_update(
  extract_links_partial,
  extractors,
  post_processing = NULL,
  wait = 3,
  user_agent = NULL,
  download_method = c("default", "chromote"),
  ...
)

Arguments

`extract_links_partial`	A partial function, typically created with `purrr::partial(.f = cas_extract_links)`, followed by the parameters originally used by `cas_extract_links()`. See examples.
`extractors`	A named list of functions. See examples for details.
`post_processing`	Defaults to NULL. If given, it must be a function that takes a data frame as input (logically, a row of the dataset) and returns it with additional or modified columns.
`wait`	Defaults to 1. Number of seconds to wait between downloading one page and the next. Can be increased to reduce server load, or can be set to 0 when this is not an issue.
`user_agent`	Defaults to NULL. If given, passed to download method.
`download_method`	Defines how the download should be implemented, (e.g. curl, wget, R internal, etc.). Currently, only "default" and "chromote" are supported. Chromote is more resource-intensive, but as it processes javascript may be helpful to download from websites where other methods fail.
`...`	Passed to `cas_get_db_file()`.

Examples


# Example of extract_links_partial:
extract_links_partial <- purrr::partial(
  .f = cas_extract_links,
  reverse_order = TRUE,
  container = "div",
  container_class = "hentry h-entry hentry_event",
  exclude_when = c("/photos", "/videos"),
  domain = "http://en.kremlin.ru/"
)

giocomai/castarter documentation built on June 12, 2025, 8:49 p.m.