README.md

{linkrot}

Project Status: Concept – Minimal or no implementation has been done
yet, or the repository is only intended to be a limited example, demo,
or
proof-of-concept. R-CMD-check Codecov test
coverage rostrum.blog
post

An R package to help detect linkrot, which is when links to a web page break because they’ve been taken down or moved.

Very much a concept. I wrote it to detect linkrot on my personal blog and it works for my needs. Feel free to contribute.

Install

This package is only available on GitHub. Install from an R session with:

install.packages("remotes")
remotes::install_github("matt-dray/linkrot")

Example

Pass a webpage URL to detect_rot() and get a tibble with each link on that page and what its response status code is (ideally we want 200).

Here’s a check on one of my older blog posts. The printout tells you the URL you’re looking at, with a period printed for each successful check.

library(linkrot)
page <-  "https://www.rostrum.blog/2018/04/14/r-trek-exploring-stardates/"
rot_page <- detect_rot(page)
#> Checking <https://www.rostrum.blog/2018/04/14/r-trek-exploring-stardates/> ..............................
rot_page
#> # A tibble: 30 x 6
#>    page    link_url    link_text response_code response_catego… response_success
#>    <chr>   <chr>       <chr>             <dbl> <chr>            <lgl>           
#>  1 https:… https://ww… R statis…           200 Success          TRUE            
#>  2 https:… https://en… Star Tre…           200 Success          TRUE            
#>  3 https:… http://www… Star Tre…           200 Success          TRUE            
#>  4 https:… https://gi… regex               200 Success          TRUE            
#>  5 https:… http://vit… tidy                200 Success          TRUE            
#>  6 https:… https://en… Wikipedia           200 Success          TRUE            
#>  7 https:… http://sel… Selector…           200 Success          TRUE            
#>  8 https:… https://cr… how-to v…           404 Client error     FALSE           
#>  9 https:… https://ww… htmlwidg…           200 Success          TRUE            
#> 10 https:… https://gi… ggsci               200 Success          TRUE            
#> # … with 20 more rows

Uh oh, at least one is broken: it has a response_code of 404.

You could iterate over multiple pages with {purrr}:

pages <- c(
  "https://www.rostrum.blog/2018/04/14/r-trek-exploring-stardates/",
  "https://www.rostrum.blog/2018/04/27/two-dogs-in-toilet-elderly-lady-involved/",
  "https://www.rostrum.blog/2018/05/19/pokeballs-in-super-smash-bros/"
)

library(purrr)
rot_pages <- set_names(map(pages, detect_rot), basename(pages))
#> Checking <https://www.rostrum.blog/2018/04/14/r-trek-exploring-stardates/> ..............................
#> Checking <https://www.rostrum.blog/2018/04/27/two-dogs-in-toilet-elderly-lady-involved/> ........................................
#> Checking <https://www.rostrum.blog/2018/05/19/pokeballs-in-super-smash-bros/> .....................
rot_pages
#> $`r-trek-exploring-stardates`
#> # A tibble: 30 x 6
#>    page    link_url    link_text response_code response_catego… response_success
#>    <chr>   <chr>       <chr>             <dbl> <chr>            <lgl>           
#>  1 https:… https://ww… R statis…           200 Success          TRUE            
#>  2 https:… https://en… Star Tre…           200 Success          TRUE            
#>  3 https:… http://www… Star Tre…           200 Success          TRUE            
#>  4 https:… https://gi… regex               200 Success          TRUE            
#>  5 https:… http://vit… tidy                200 Success          TRUE            
#>  6 https:… https://en… Wikipedia           200 Success          TRUE            
#>  7 https:… http://sel… Selector…           200 Success          TRUE            
#>  8 https:… https://cr… how-to v…           404 Client error     FALSE           
#>  9 https:… https://ww… htmlwidg…           200 Success          TRUE            
#> 10 https:… https://gi… ggsci               200 Success          TRUE            
#> # … with 20 more rows
#> 
#> $`two-dogs-in-toilet-elderly-lady-involved`
#> # A tibble: 40 x 6
#>    page     link_url   link_text response_code response_catego… response_success
#>    <chr>    <chr>      <chr>             <dbl> <chr>            <lgl>           
#>  1 https:/… https://w… @mattdray           200 Success          TRUE            
#>  2 https:/… https://d… the Lond…           200 Success          TRUE            
#>  3 https:/… https://g… the sf p…           200 Success          TRUE            
#>  4 https:/… https://r… interact…           200 Success          TRUE            
#>  5 https:/… https://e… eastings…           200 Success          TRUE            
#>  6 https:/… https://e… latitude            200 Success          TRUE            
#>  7 https:/… https://e… longitude           200 Success          TRUE            
#>  8 https:/… https://r… leaflet             200 Success          TRUE            
#>  9 https:/… https://w… R                   200 Success          TRUE            
#> 10 https:/… https://g… sf (‘sim…           200 Success          TRUE            
#> # … with 30 more rows
#> 
#> $`pokeballs-in-super-smash-bros`
#> # A tibble: 21 x 6
#>    page    link_url    link_text response_code response_catego… response_success
#>    <chr>   <chr>       <chr>             <dbl> <chr>            <lgl>           
#>  1 https:… https://en… Super Sm…           200 Success          TRUE            
#>  2 https:… https://en… Super Sm…           400 Client error     FALSE           
#>  3 https:… https://en… SSB Mele…           200 Success          TRUE            
#>  4 https:… https://en… SSB Braw…           200 Success          TRUE            
#>  5 https:… https://en… SSB ‘4’,…           200 Success          TRUE            
#>  6 https:… https://ww… a series…           200 Success          TRUE            
#>  7 https:… https://en… the Supe…           200 Success          TRUE            
#>  8 https:… https://en… Zelda               200 Success          TRUE            
#>  9 https:… https://en… EarthBou…           200 Success          TRUE            
#> 10 https:… https://en… the Poké…           400 Client error     FALSE           
#> # … with 11 more rows

Uh-oh, more broken links.

Code of Conduct

Please note that the {linkrot} project is released with a Contributor Code of Conduct. By contributing to this project, you agree to abide by its terms.



matt-dray/linkrot documentation built on Dec. 21, 2021, 2:54 p.m.