Use tidy data principles to interact with HTML files!
This package is meant to ease web scraping with Selenium by “tidying” the html structure. To do so, it iterates recursively on web elements until a given depth and returns a tibble, with the children elements nested in list-columns. That way, tidy principles can be used to identify specific elements and eventually interact with them.
# remotes::install_github("benjaminguinaudeau/tidyweb")
library(tidyweb)
library(dplyr)
page <- xml2::read_html("https://www.nytimes.com/")
art <- page %>%
rvest::html_nodes("article")
parsed_art <- art %>% tidy_element(depth = 10)
parsed_art %>% glimpse
parsed_art %>% filter(!is.na(href)) %>% glimpse
parsed_art %>%
separate_rows(class, sep = "\\s+") %>%
count(class, sort = T) %>%
glimpse
parsed_art %>%
mutate(depth = str_count(.id, "_") + 1) %>%
group_by(depth) %>%
ggplot(aes(x = depth)) + geom_histogram()
A huge thank you to Favstats for designing each of the hex-stickers.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.