The package htmldf
contains a single function html_df()
which accepts a vector of urls as an input and from each will attempt to download each page, extract and parse the html. The result is returned as a tibble
where each row corresponds to a document, and the columns contain page attributes and metadata extracted from the html, including:
To install the CRAN version of the package:
install.packages('htmldf')
To install the development version of the package:
remotes::install_github('alastairrushworth/htmldf')
First define a vector of URLs you want to gather information from. The function html_df()
returns a tibble
where each row corresponds to a webpage, and each column corresponds to an attribute of that webpage:
library(htmldf) library(dplyr) # An example vector of URLs to fetch data for urlx <- c("https://alastairrushworth.github.io/Visualising-Tour-de-France-data-in-R/", "https://medium.com/dair-ai/pytorch-1-2-introduction-guide-f6fa9bb7597c", "https://www.tensorflow.org/tutorials/images/cnn", "https://www.analyticsvidhya.com/blog/2019/09/introduction-to-pytorch-from-scratch/") # use html_df() to gather data z <- html_df(urlx, show_progress = FALSE) # have a quick look at the first page glimpse(z[1, ])
To see the page titles, look at the titles
column.
z %>% select(title, url2)
Where there are tables embedded on a page in the <table>
tag, these will be gathered into the list column tables
. html_df
will attempt to coerce each table to tibble
- where that isn't possible, the raw html is returned instead.
z$tables
html_df()
does its best to find RSS feeds embedded in the page:
z$rss
html_df()
will try to parse out any social profiles embedded or mentioned on the page. Currently, this includes profiles for the sites
bitbucket
, dev.to
, discord
, facebook
, github
, gitlab
, instagram
, kakao
, keybase
, linkedin
, mastodon
, medium
, orcid
, patreon
, researchgate
, stackoverflow
, reddit
, telegram
, twitter
, youtube
z$social
Code language is inferred from <code>
chunks using a preditive model. The code_lang
column contains a numeric score where values near 1 indicate mostly R code, values near -1 indicate mostly Python code:
z %>% select(code_lang, url2)
Publication dates
z %>% select(published, url2)
Any feedback is welcome! Feel free to write a github issue or send me a message on twitter.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.