hrbrpkghelpr::global_opts()
hrbrpkghelpr::stinking_badges()
hrbrpkghelpr::yank_title_and_description()
The following functions are implemented:
web_client
/webclient
: Create a new HtmlUnit WebClient instance
wc_go
: Visit a URL
wc_html_nodes
: Select nodes from web client active page html content
wc_html_text
: Extract attributes, text and tag name from webclient page html contentwc_html_attr
: Extract attributes, text and tag name from webclient page html contentwc_html_name
: Extract attributes, text and tag name from webclient page html content
wc_headers
: Return response headers of the last web request for current page
wc_browser_info
: Retreive information about the browser used to create the 'webclient'wc_content_length
: Return content length of the last web request for current pagewc_content_type
: Return content type of web request for current page
wc_render
: Retrieve current page contents
wc_css
: Enable/Disable CSS support
wc_dnt
: Enable/Disable Do-Not-Trackwc_geo
: Enable/Disable Geolocationwc_img_dl
: Enable/Disable Image Downloadingwc_load_time
: Return load time of the last web request for current pagewc_resize
: Resize the virtual browser windowwc_status
: Return status code of web request for current pagewc_timeout
: Change default request timeoutwc_title
: Return page title for current pagewc_url
: Return load time of the last web request for current pagewc_use_insecure_ssl
: Enable/Disable Ignoring SSL Validation Issueswc_wait
: Block HtlUnit final rendering blocks until all background JavaScript tasks have finished executinghu_read_html
: Read HTML from a URL with Browser Emulation & in a JavaScript Contextwc_inspect
: Perform a "Developer Tools"-like Network Inspection of a URLhrbrpkghelpr::install_block()
library(htmlunit) library(tidyverse) # for some data ops; not req'd for pkg # current verison packageVersion("htmlunit")
Something xml2::read_html()
cannot do, read the table from https://hrbrmstr.github.io/htmlunitjars/index.html:
test_url <- "https://hrbrmstr.github.io/htmlunitjars/index.html" pg <- xml2::read_html(test_url) html_table(pg)
☹️
But, hu_read_html()
can!
pg <- hu_read_html(test_url) html_table(pg)
All without needing a separate Selenium or Splash server instance.
We can also get a HAR-like content + metadata dump:
xdf <- wc_inspect("https://rstudio.com") colnames(xdf) select(xdf, method, url, status_code, content_length, load_time) group_by(xdf, content_type) %>% summarise( total_size = sum(content_length), total_load_time = sum(load_time)/1000 )
wc <- web_client(emulate = "chrome") wc %>% wc_browser_info() wc <- web_client() wc %>% wc_go("https://usa.gov/") # if you want to use purrr::map_ functions the result of wc_html_nodes() needs to be passed to as.list() wc %>% wc_html_nodes("a") %>% sapply(wc_html_text, trim = TRUE) %>% head(10) wc %>% wc_html_nodes(xpath=".//a") %>% sapply(wc_html_text, trim = TRUE) %>% head(10) wc %>% wc_html_nodes(xpath=".//a") %>% sapply(wc_html_attr, "href") %>% head(10)
Handy function to get rendered plain text for text mining:
wc %>% wc_render("text") %>% substr(1, 300) %>% cat()
cloc::cloc_pkg_md()
Please note that this project is released with a Contributor Code of Conduct. By participating in this project you agree to abide by its terms.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.