cas_extract_html: Facilitates extraction of contents from an html file
In giocomai/castarter: Content Analysis Starter Toolkit

cas_extract_html

R Documentation

Facilitates extraction of contents from an html file

Description

Facilitates extraction of contents from an html file

Usage

cas_extract_html(
  html_document,
  container = NULL,
  container_class = NULL,
  container_id = NULL,
  container_name = NULL,
  container_property = NULL,
  container_itemprop = NULL,
  container_instance = NULL,
  attribute = NULL,
  sub_element = NULL,
  no_children = NULL,
  trim = TRUE,
  squish = FALSE,
  no_match = "",
  exclude_css_path = NULL,
  exclude_xpath = NULL,
  custom_xpath = NULL,
  custom_css_path = NULL,
  keep_everything = FALSE,
  extract_text = TRUE,
  as_character = TRUE
)

Arguments

`html_document`	An html document parsed with `xml2::read_html()` or `rvest::read_html()`.
`container`	Defaults to NULL. Type of html container from where links are to be extracted, such as "div", "ul", and others. Either `container_class` or `container_id` must also be provided.
`container_class`	Defaults to NULL. If provided, also `container` must be given (and `container_id` must be NULL). Only text found inside the provided combination of container/class will be extracted.
`container_id`	Defaults to NULL. If provided, also `container` must be given (and `container_id` must be NULL). Only text found inside the provided combination of container/class will be extracted.
`container_itemprop`	Defaults to NULL. If provided, also `container` must be given (and `container_id` and `container_class` must be NULL or will be silently ignored). Only text found inside the provided combination of container/itemprop will be extracted.
`container_instance`	Defaults to NULL. If given, it must be an integer. If a given combination is found more than once in the same page, the relevant occurrence is kept. Use with caution, as not all pages always include the same number of elements of the same class/with the same id.
`attribute`	Defaults to NULL. If given, type of attribute to extract. Typically used in combination with container, as in `cas_extract_html(container = "time", attribute = "datetime")`.
`sub_element`	Defaults to NULL. If provided, also `container` must be given. Only text within elements of given type under the chosen combination of container/containerClass will be extracted. When given, it will tipically be "p", to extract all p elements inside the selected div.
`no_children`	Defaults to FALSE, i.e. by default all subelements of the selected combination (e.g. div with given class) are extracted. If TRUE, only text found under the given combination (but not its subelements) will be extracted. Corresponds to the xpath string `⁠/node()[not(self::div)]⁠`.
`trim`	Defaults to TRUE. If TRUE, applies `stringr::str_trim()` to output, removing whitespace from start and end of string.
`squish`	Defaults to FALSE. If TRUE, applies `stringr::str_squish()` to output, removing whitespace from start and end of string, and replacing any whitespace (including new lines) with a single space.
`no_match`	Defaults to "". A common alternative would be NA. Value to return when the given container, selector or element is not found.
`exclude_css_path`	Defaults to NULL. To remove script, for example, use `script`, which is transformed to `⁠:not(script)⁠`. May cause issues, use with caution.
`exclude_xpath`	Defaults to NULL. A common pattern when extracting text would be `⁠//script\|//iframe\|//img\|//style⁠`, as it is assumed that these containers (javascript contents, iframes, css blocks, and images) are most likely undesirable when extracting text. Customise as needed. For example, if besides the above you also want to remove a `div` of class `related-articles`, you may use `⁠//script\|//iframe\|//img\|//div[@class='related-articles']⁠`Be careful when using `exclude_xpath` as the relevant Xpath is removed from the original objext passed to `cas_extract_html()`. To be clear, the input object is changed, and, for example, if used once in one of the extractors these containers won't be available to other extractors.
`custom_xpath`	Defaults to NULL. If given, all other parameters are ignored and given Xpath used instead.
`custom_css_path`	Defaults to NULL. If given, all other parameters are ignored and given CSSpath used instead.
`keep_everything`	Defaults to FALSE. If TRUE, all text included in the page is returned as a single string.
`extract_text`	Defaults to TRUE. If TRUE, text is extracted.
`as_character`	Defaults to TRUE. If FALSE, and if `extract_text` is set to FALSE, then an `xml_nodeset` object is returned.

Value

A character vector of length one.

Examples

## Not run: 
if (interactive()) {
  url <- "https://example.com"
  html_document <- rvest::read_html(x = url)

  # example for a tag that looks like:
  # <meta name="twitter:title" content="Example title" />

  cas_extract_html(
    html_document = html_document,
    container = "meta",
    container_name = "twitter:title",
    attribute = "content"
  )


  # example for a tag that looks like:
  # <meta name="keywords" content="various;keywords;">
  cas_extract_html(
    html_document = html_document,
    container = "meta",
    container_name = "keywords",
    attribute = "content"
  )

  # example for a tag that looks like:
  # <meta property="article:published_time" content="2016-10-29T13:09+03:00"/>
  cas_extract_html(
    html_document = html_document,
    container = "meta",
    container_property = "article:published_time",
    attribute = "content"
  )
}

## End(Not run)

giocomai/castarter documentation built on June 12, 2025, 8:49 p.m.