README.md
In benjaminguinaudeau/tidyweb: What the Package Does (One Line, Title Case)

tidyweb

Use tidy data principles to interact with HTML files!

This package is meant to ease web scraping with Selenium by “tidying” the html structure. To do so, it iterates recursively on web elements until a given depth and returns a tibble, with the children elements nested in list-columns. That way, tidy principles can be used to identify specific elements and eventually interact with them.

# remotes::install_github("benjaminguinaudeau/tidyweb")
library(tidyweb)
library(dplyr)

page <- xml2::read_html("https://www.nytimes.com/")

art <- page %>%
  rvest::html_nodes("article")


parsed_art <- art %>% tidy_element(depth = 10) 

parsed_art %>% glimpse
parsed_art %>% filter(!is.na(href)) %>% glimpse
parsed_art %>% 
  separate_rows(class, sep = "\\s+") %>%
  count(class, sort = T) %>%
  glimpse

parsed_art %>% 
  mutate(depth = str_count(.id, "_") + 1) %>%
  group_by(depth) %>%
  ggplot(aes(x = depth)) + geom_histogram()

A huge thank you to Favstats for designing each of the hex-stickers.

benjaminguinaudeau/tidyweb documentation built on Jan. 8, 2020, 8:12 a.m.

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

Tweet to @rdrrHQ

GitHub issue tracker

ian@mutexlabs.com