cas_extract_links: Extract direct links to individual content pages from index...
In giocomai/castarter: Content Analysis Starter Toolkit

cas_extract_links

R Documentation

Extract direct links to individual content pages from index pages

Description

Extract direct links to individual content pages from index pages

Usage

cas_extract_links(
  id = NULL,
  batch = "latest",
  domain = NULL,
  index = TRUE,
  index_group = NULL,
  output_index = FALSE,
  output_index_group = NULL,
  include_when = NULL,
  exclude_when = NULL,
  container = NULL,
  container_class = NULL,
  container_id = NULL,
  custom_xpath = NULL,
  custom_css = NULL,
  match = NULL,
  min_length = NULL,
  max_length = NULL,
  attribute_type = "href",
  append_string = NULL,
  remove_string = NULL,
  write_to_db = FALSE,
  file_format = "html",
  keep_only_within_domain = TRUE,
  sample = FALSE,
  check_previous = TRUE,
  check_again = FALSE,
  encoding = "UTF-8",
  reverse_order = FALSE,
  db_connection = NULL,
  disconnect_db = TRUE,
  ...
)

Arguments

`id`	Defaults to NULL. If provided, it should be a vector of integers. Only html files corresponding to given id will be processed.
`domain`	Defaults to "". Web domain of the website. It is added at the beginning of each link found. If links in the page already include the full web address this should be ignored.
`output_index`	Defaults to FALSE. If FALSE, new links are added to the contents table. If TRUE, the links extracted will be stored again as index, using `output_index_group` as `index_group`.
`output_index_group`	Defaults to NULL. Relevant only when `output_index` is set to TRUE. Used to store new index urls in the database with reference to the appropriate group.
`include_when`	Part of URL found only in links of individual articles to be downloaded. If more than one provided, it includes all links that contains either of the strings provided.
`exclude_when`	If an URL includes this string, it is excluded from the output. One or more strings may be provided.
`container`	Defaults to NULL. Type of html container from where links are to be extracted, such as "div", "ul", and others. Either `container_class` or `container_id` must also be provided.
`container_class`	Defaults to NULL. If provided, also `container` must be given (and `container_id` must be NULL). Only text found inside the provided combination of container/class will be extracted.
`container_id`	Defaults to NULL. If provided, also `container` must be given (and `container_id` must be NULL). Only text found inside the provided combination of container/class will be extracted.
`custom_xpath`	Defaults to NULL. If given, all other parameters are ignored and given Xpath used instead.
`match`	Defaults to NULL. Used when extracting json files. Name of property from where url is to be extracted. N.B. Only partly implemented, please report issues along with specific example where it emerged.
`min_length`	If a link is shorter than the number of characters given in min_length, it is excluded from the output.
`max_length`	If a link is longer than the number of characters given in max_length, it is excluded from the output.
`attribute_type`	Defaults to "href". Type of attribute to extract from links.
`append_string`	If provided, appends given string to the extracted articles. Typically used to create links for print or mobile versions of the extracted page.
`remove_string`	If provided, remove given string (or strings) from links.
`write_to_db`	Logical, defaults to FALSE. If TRUE stored newly extracted links in the database, associates each of them with an id, and records the source for each link.
`keep_only_within_domain`	Logical, defaults to TRUE. If TRUE, and domain given, links to external websites are dropped.
`check_previous`	Defaults to TRUE. If TRUE, checks if newly found links are previously stored in database, and if they are, it discards them. If FALSE, and `write_to_db` is also set to FALSE, it does not check for previously stored links.
`check_again`	Defaults to FALSE. If FALSE, files from where are at least a link has been extracted are not re-processed. If TRUE, they are processed again. By default, only new links are then actually included in the output or stored in the local database.
`reverse_order`	Logical, defaults to FALSE. If TRUE, index files are processed in reverse order of `id` and `batch`, which may give more meaningful order to content id. The difference is ultimately cosmetic, and has no substantive impact either way.
`db_connection`	Defaults to NULL. If NULL, uses local SQLite database. If given, must be a connection object or a list with relevant connection settings (see example).
`disconnect_db`	Defaults to TRUE. If FALSE, leaves the connection to database open.
`...`	Passed to `cas_get_db_file()`.

Value

A data frame.

Examples

## Not run: 
links <- cas_extract_links(domain = "http://www.example.com/")

## End(Not run)

giocomai/castarter documentation built on June 12, 2025, 8:49 p.m.

giocomai/castarter index

README.md

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

giocomai/castarter
Content Analysis Starter Toolkit

cas_extract_links: Extract direct links to individual content pages from index...
In giocomai/castarter: Content Analysis Starter Toolkit

Extract direct links to individual content pages from index pages

Description

Usage

Arguments

Value

Examples

Related to cas_extract_links in giocomai/castarter...

R Package Documentation

Browse R Packages

We want your feedback!

giocomai/castarter Content Analysis Starter Toolkit

cas_extract_links: Extract direct links to individual content pages from index... In giocomai/castarter: Content Analysis Starter Toolkit

Extract direct links to individual content pages from index pages

Description

Usage

Arguments

Value

Examples

Related to cas_extract_links in giocomai/castarter...

R Package Documentation

Browse R Packages

We want your feedback!

giocomai/castarter
Content Analysis Starter Toolkit

cas_extract_links: Extract direct links to individual content pages from index...
In giocomai/castarter: Content Analysis Starter Toolkit