In hrbrmstr/htmlunit: Tools to Scrape Dynamic Web Content via the 'HtmlUnit' Java Library

hrbrpkghelpr::global_opts()

hrbrpkghelpr::stinking_badges()

hrbrpkghelpr::yank_title_and_description()

What's Inside The Tin

The following functions are implemented:

DSL

web_client/webclient: Create a new HtmlUnit WebClient instance
wc_go: Visit a URL
wc_html_nodes: Select nodes from web client active page html content
wc_html_text: Extract attributes, text and tag name from webclient page html content
wc_html_attr: Extract attributes, text and tag name from webclient page html content
wc_html_name: Extract attributes, text and tag name from webclient page html content
wc_headers: Return response headers of the last web request for current page
wc_browser_info: Retreive information about the browser used to create the 'webclient'
wc_content_length: Return content length of the last web request for current page
wc_content_type: Return content type of web request for current page
wc_render: Retrieve current page contents
wc_css: Enable/Disable CSS support
wc_dnt: Enable/Disable Do-Not-Track
wc_geo: Enable/Disable Geolocation
wc_img_dl: Enable/Disable Image Downloading
wc_load_time: Return load time of the last web request for current page
wc_resize: Resize the virtual browser window
wc_status: Return status code of web request for current page
wc_timeout: Change default request timeout
wc_title: Return page title for current page
wc_url: Return load time of the last web request for current page
wc_use_insecure_ssl: Enable/Disable Ignoring SSL Validation Issues
wc_wait: Block HtlUnit final rendering blocks until all background JavaScript tasks have finished executing

Just the Content (pls)

hu_read_html: Read HTML from a URL with Browser Emulation & in a JavaScript Context

Content++

wc_inspect: Perform a "Developer Tools"-like Network Inspection of a URL

Installation

hrbrpkghelpr::install_block()

Usage

library(htmlunit)
library(tidyverse) # for some data ops; not req'd for pkg

# current verison
packageVersion("htmlunit")

Something xml2::read_html() cannot do, read the table from https://hrbrmstr.github.io/htmlunitjars/index.html:

test_url <- "https://hrbrmstr.github.io/htmlunitjars/index.html"

pg <- xml2::read_html(test_url)

html_table(pg)

☹️

But, hu_read_html() can!

pg <- hu_read_html(test_url)

html_table(pg)

All without needing a separate Selenium or Splash server instance.

Content++

We can also get a HAR-like content + metadata dump:

xdf <- wc_inspect("https://rstudio.com")

colnames(xdf)

select(xdf, method, url, status_code, content_length, load_time)

group_by(xdf, content_type) %>% 
  summarise(
    total_size = sum(content_length), 
    total_load_time = sum(load_time)/1000
  )

DSL

wc <- web_client(emulate = "chrome")

wc %>% wc_browser_info()

wc <- web_client()

wc %>% wc_go("https://usa.gov/")

# if you want to use purrr::map_ functions the result of wc_html_nodes() needs to be passed to as.list()

wc %>%
  wc_html_nodes("a") %>%
  sapply(wc_html_text, trim = TRUE) %>% 
  head(10)

wc %>%
  wc_html_nodes(xpath=".//a") %>%
  sapply(wc_html_text, trim = TRUE) %>% 
  head(10)

wc %>%
  wc_html_nodes(xpath=".//a") %>%
  sapply(wc_html_attr, "href") %>% 
  head(10)

Handy function to get rendered plain text for text mining:

wc %>% 
  wc_render("text") %>% 
  substr(1, 300) %>% 
  cat()

htmlunit Metrics

cloc::cloc_pkg_md()

Code of Conduct

Please note that this project is released with a Contributor Code of Conduct. By participating in this project you agree to abide by its terms.

hrbrmstr/htmlunit documentation built on Aug. 19, 2020, 3:05 p.m.

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

hrbrmstr/htmlunit
Tools to Scrape Dynamic Web Content via the 'HtmlUnit' Java Library

In hrbrmstr/htmlunit: Tools to Scrape Dynamic Web Content via the 'HtmlUnit' Java Library

What's Inside The Tin

DSL

Just the Content (pls)

Content++

Installation

Usage

Content++

DSL

htmlunit Metrics

Code of Conduct

R Package Documentation

Browse R Packages

We want your feedback!

hrbrmstr/htmlunit Tools to Scrape Dynamic Web Content via the 'HtmlUnit' Java Library

In hrbrmstr/htmlunit: Tools to Scrape Dynamic Web Content via the 'HtmlUnit' Java Library

What's Inside The Tin

DSL

Just the Content (pls)

Content++

Installation

Usage

Content++

DSL

htmlunit Metrics

Code of Conduct

R Package Documentation

Browse R Packages

We want your feedback!

hrbrmstr/htmlunit
Tools to Scrape Dynamic Web Content via the 'HtmlUnit' Java Library