htmldf

build codecov CRAN status cran checks

Overview

The package htmldf contains a single function html_df() which accepts a vector of urls as an input and from each will attempt to download each page, extract and parse the html. The result is returned as a tibble where each row corresponds to a document, and the columns contain page attributes and metadata extracted from the html, including:

Installation

To install the CRAN version of the package:

install.packages('htmldf')

To install the development version of the package:

remotes::install_github('alastairrushworth/htmldf')

Usage

First define a vector of URLs you want to gather information from. The function html_df() returns a tibble where each row corresponds to a webpage, and each column corresponds to an attribute of that webpage:

library(htmldf)
library(dplyr)

# An example vector of URLs to fetch data for
urlx <- c("https://alastairrushworth.github.io/Visualising-Tour-de-France-data-in-R/",
          "https://medium.com/dair-ai/pytorch-1-2-introduction-guide-f6fa9bb7597c",
          "https://www.tensorflow.org/tutorials/images/cnn", 
          "https://www.analyticsvidhya.com/blog/2019/09/introduction-to-pytorch-from-scratch/")

# use html_df() to gather data
z <- html_df(urlx, show_progress = FALSE)

# have a quick look at the first page
glimpse(z[1, ])

To see the page titles, look at the titles column.

z %>% select(title, url2)

Where there are tables embedded on a page in the <table> tag, these will be gathered into the list column tables. html_df will attempt to coerce each table to tibble - where that isn't possible, the raw html is returned instead.

z$tables

html_df() does its best to find RSS feeds embedded in the page:

z$rss

html_df() will try to parse out any social profiles embedded or mentioned on the page. Currently, this includes profiles for the sites

bitbucket, dev.to, discord, facebook, github, gitlab, instagram, kakao, keybase, linkedin, mastodon, medium, orcid, patreon, researchgate, stackoverflow, reddit, telegram, twitter, youtube

z$social

Code language is inferred from <code> chunks using a preditive model. The code_lang column contains a numeric score where values near 1 indicate mostly R code, values near -1 indicate mostly Python code:

z %>% select(code_lang, url2)

Publication dates

z %>% select(published, url2)

Comments? Suggestions? Issues?

Any feedback is welcome! Feel free to write a github issue or send me a message on twitter.



alastairrushworth/htmldf documentation built on Aug. 10, 2022, 3:02 p.m.