View source: R/cas_extract_html.R
cas_extract_html | R Documentation |
Facilitates extraction of contents from an html file
cas_extract_html(
html_document,
container = NULL,
container_class = NULL,
container_id = NULL,
container_name = NULL,
container_property = NULL,
container_itemprop = NULL,
container_instance = NULL,
attribute = NULL,
sub_element = NULL,
no_children = NULL,
trim = TRUE,
squish = FALSE,
no_match = "",
exclude_css_path = NULL,
exclude_xpath = NULL,
custom_xpath = NULL,
custom_css_path = NULL,
keep_everything = FALSE,
extract_text = TRUE,
as_character = TRUE
)
html_document |
An html document parsed with |
container |
Defaults to NULL. Type of html container from where links
are to be extracted, such as "div", "ul", and others. Either
|
container_class |
Defaults to NULL. If provided, also |
container_id |
Defaults to NULL. If provided, also |
container_itemprop |
Defaults to NULL. If provided, also |
container_instance |
Defaults to NULL. If given, it must be an integer. If a given combination is found more than once in the same page, the relevant occurrence is kept. Use with caution, as not all pages always include the same number of elements of the same class/with the same id. |
attribute |
Defaults to NULL. If given, type of attribute to extract.
Typically used in combination with container, as in
|
sub_element |
Defaults to NULL. If provided, also |
no_children |
Defaults to FALSE, i.e. by default all subelements of the
selected combination (e.g. div with given class) are extracted. If TRUE,
only text found under the given combination (but not its subelements) will
be extracted. Corresponds to the xpath string |
trim |
Defaults to TRUE. If TRUE, applies |
squish |
Defaults to FALSE. If TRUE, applies |
no_match |
Defaults to "". A common alternative would be NA. Value to return when the given container, selector or element is not found. |
exclude_css_path |
Defaults to NULL. To remove script, for example, use
|
exclude_xpath |
Defaults to NULL. A common pattern when extracting text
would be |
custom_xpath |
Defaults to NULL. If given, all other parameters are ignored and given Xpath used instead. |
custom_css_path |
Defaults to NULL. If given, all other parameters are ignored and given CSSpath used instead. |
keep_everything |
Defaults to FALSE. If TRUE, all text included in the page is returned as a single string. |
extract_text |
Defaults to TRUE. If TRUE, text is extracted. |
as_character |
Defaults to TRUE. If FALSE, and if |
A character vector of length one.
## Not run:
if (interactive()) {
url <- "https://example.com"
html_document <- rvest::read_html(x = url)
# example for a tag that looks like:
# <meta name="twitter:title" content="Example title" />
cas_extract_html(
html_document = html_document,
container = "meta",
container_name = "twitter:title",
attribute = "content"
)
# example for a tag that looks like:
# <meta name="keywords" content="various;keywords;">
cas_extract_html(
html_document = html_document,
container = "meta",
container_name = "keywords",
attribute = "content"
)
# example for a tag that looks like:
# <meta property="article:published_time" content="2016-10-29T13:09+03:00"/>
cas_extract_html(
html_document = html_document,
container = "meta",
container_property = "article:published_time",
attribute = "content"
)
}
## End(Not run)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.