hrbrpkghelpr::global_opts()
hrbrpkghelpr::stinking_badges()
hrbrpkghelpr::yank_title_and_description()

What's Inside The Tin

The following functions are implemented:

DSL

Just the Content (pls)

Content++

Installation

hrbrpkghelpr::install_block()

Usage

library(htmlunit)
library(tidyverse) # for some data ops; not req'd for pkg

# current verison
packageVersion("htmlunit")

Something xml2::read_html() cannot do, read the table from https://hrbrmstr.github.io/htmlunitjars/index.html:

test_url <- "https://hrbrmstr.github.io/htmlunitjars/index.html"

pg <- xml2::read_html(test_url)

html_table(pg)

☹️

But, hu_read_html() can!

pg <- hu_read_html(test_url)

html_table(pg)

All without needing a separate Selenium or Splash server instance.

Content++

We can also get a HAR-like content + metadata dump:

xdf <- wc_inspect("https://rstudio.com")

colnames(xdf)

select(xdf, method, url, status_code, content_length, load_time)

group_by(xdf, content_type) %>% 
  summarise(
    total_size = sum(content_length), 
    total_load_time = sum(load_time)/1000
  )

DSL

wc <- web_client(emulate = "chrome")

wc %>% wc_browser_info()

wc <- web_client()

wc %>% wc_go("https://usa.gov/")

# if you want to use purrr::map_ functions the result of wc_html_nodes() needs to be passed to as.list()

wc %>%
  wc_html_nodes("a") %>%
  sapply(wc_html_text, trim = TRUE) %>% 
  head(10)

wc %>%
  wc_html_nodes(xpath=".//a") %>%
  sapply(wc_html_text, trim = TRUE) %>% 
  head(10)

wc %>%
  wc_html_nodes(xpath=".//a") %>%
  sapply(wc_html_attr, "href") %>% 
  head(10)

Handy function to get rendered plain text for text mining:

wc %>% 
  wc_render("text") %>% 
  substr(1, 300) %>% 
  cat()

htmlunit Metrics

cloc::cloc_pkg_md()

Code of Conduct

Please note that this project is released with a Contributor Code of Conduct. By participating in this project you agree to abide by its terms.



hrbrmstr/htmlunit documentation built on Aug. 19, 2020, 3:05 p.m.