README.md

robotstxt Hand-drawn robot inside a hex sticker

R-CMD-check Peer
Reviewed Monthly
Downloads Total
Downloads Cran
Checks Lifecycle:
Stable Codecov test
coverage

A ‘robots.txt’ Parser and ‘Webbot’/‘Spider’/‘Crawler’ Permissions Checker. Provides functions to download and parse ‘robots.txt’ files. Ultimately the package makes it easy to check if bots (spiders, crawler, scrapers, …) are allowed to access specific resources on a domain.

Installation

Install from CRAN using

install.packages("robotstxt")

Or install the development version:

devtools::install_github("ropensci/robotstxt")

License

MIT + file LICENSE c( person( “Pedro”, “Baltazar”, role = c(“ctb”), email = “pedrobtz@gmail.com” ), person( “Jordan”, “Bradford”, role = c(“cre”), email = “jrdnbradford@gmail.com” ), person( “Peter”, “Meissner”, role = c(“aut”), email = “retep.meissner@gmail.com” ), person( “Kun”, “Ren”, email = “mail@renkun.me”, role = c(“aut”, “cph”), comment = “Author and copyright holder of list_merge.R.” ), person(“Oliver”, “Keys”, role = “ctb”, comment = “original release code review”), person(“Rich”, “Fitz John”, role = “ctb”, comment = “original release code review”) )

Citation

To cite package 'robotstxt' in publications use:

  Meissner P, Ren K (2024). _robotstxt: A 'robots.txt' Parser and
  'Webbot'/'Spider'/'Crawler' Permissions Checker_. R package version 0.7.15,
  https://github.com/ropensci/robotstxt, <https://docs.ropensci.org/robotstxt/>.

A BibTeX entry for LaTeX users is

  @Manual{,
    title = {robotstxt: A 'robots.txt' Parser and 'Webbot'/'Spider'/'Crawler' Permissions Checker},
    author = {Peter Meissner and Kun Ren},
    year = {2024},
    note = {R package version 0.7.15, https://github.com/ropensci/robotstxt},
    url = {https://docs.ropensci.org/robotstxt/},
  }

Usage

Review the package index reference or use

?robotstxt

for documentation.

Simple path access right checking (the functional way) …

options(robotstxt_warn = FALSE)

paths_allowed(
  paths  = c("/api/rest_v1/?doc", "/w/"),
  domain = "wikipedia.org",
  bot    = "*"
)
##  wikipedia.org
## [1]  TRUE FALSE

paths_allowed(
  paths = c(
    "https://wikipedia.org/api/rest_v1/?doc",
    "https://wikipedia.org/w/"
  )
)
##  wikipedia.org                       wikipedia.org
## [1]  TRUE FALSE

… or (the object oriented way) …

options(robotstxt_warn = FALSE)

rtxt <- robotstxt(domain = "wikipedia.org")

rtxt$check(
  paths = c("/api/rest_v1/?doc", "/w/"),
  bot   = "*"
)
## [1]  TRUE FALSE

Retrieval

Retrieving the robots.txt file for a domain:

# retrieval
rt <- get_robotstxt("https://petermeissner.de")

# printing
rt
## [robots.txt]
## --------------------------------------
## 
## # just do it - punk

Interpretation

Checking whether or not one is supposadly allowed to access some resource from a web server is - unfortunately - not just a matter of downloading and parsing a simple robots.txt file.

First there is no official specification for robots.txt files so every robots.txt file written and every robots.txt file read and used is an interpretation. Most of the time we all have a common understanding on how things are supposed to work but things get more complicated at the edges.

Some interpretation problems:

Event Handling

Because the interpretation of robots.txt rules not just depends on the rules specified within the file, the package implements an event handler system that allows to interpret and re-interpret events into rules.

Under the hood the rt_request_handler() function is called within get_robotstxt(). This function takes an {httr} request-response object and a set of event handlers. Processing the request and the handlers it checks for various events and states around getting the file and reading in its content. If an event/state happened the event handlers are passed on to the request_handler_handler() along for problem resolution and collecting robots.txt file transformations:

Event handler rules can either consist of 4 items or can be functions - the former being the usual case and that used throughout the package itself. Functions like paths_allowed() do have parameters that allow passing along handler rules or handler functions.

Handler rules are lists with the following items:

The package knows the following rules with the following defaults:

Design Map for Event/State Handling

from version 0.7.x onwards

While previous releases were concerned with implementing parsing and permission checking and improving performance the 0.7.x release will be about robots.txt retrieval foremost. While retrieval was implemented there are corner cases in the retrieval stage that very well influence the interpretation of permissions granted.

Features and Problems handled:

Design Decisions

  1. the whole HTTP request-response-chain is checked for certain event/state types
    • server error
    • client error
    • file not found (404)
    • redirection
    • redirection to another domain
  2. the content returned by the HTTP is checked against
    • mime type / file type specification mismatch
    • suspicious content (file content does seem to be JSON, HTML, or XML instead of robots.txt)
  3. state/event handler define how these states and events are handled
  4. a handler handler executes the rules defined in individual handlers
  5. handler can be overwritten
  6. handler defaults are defined that they should always do the right thing
  7. handler can …
    • overwrite the content of a robots.txt file (e.g. allow/disallow all)
    • modify how problems should be signaled: error, warning, message, none
    • if robots.txt file retrieval should be cached or not
  8. problems (no matter how they were handled) are attached to the robots.txt’s as attributes, allowing for …
    • transparency
    • reacting post-mortem to the problems that occured
  9. all handler (even the actual execution of the HTTP-request) can be overwritten at runtime to inject user defined behaviour beforehand

Warnings

By default all functions retrieving robots.txt files will warn if there are

Warnings can be turned off in several ways:

suppressWarnings({
  paths_allowed("PATH_WITH_WARNING")
})
paths_allowed("PATH_WITH_WARNING", warn = FALSE)
options(robotstxt_warn = FALSE)
paths_allowed("PATH_WITH_WARNING")

Inspection and Debugging

The robots.txt files retrieved are basically mere character vectors:

rt <- get_robotstxt("petermeissner.de")

as.character(rt)
## [1] "# just do it - punk\n"

cat(rt)
## # just do it - punk

The last HTTP request is stored in an object

rt_last_http$request
## Response [https://petermeissner.de/robots.txt]
##   Date: 2024-09-02 01:32
##   Status: 200
##   Content-Type: text/plain
##   Size: 20 B
## # just do it - punk

But they also have some additional information stored as attributes.

names(attributes(rt))
## [1] "problems" "cached"   "request"  "class"

Events that might change the interpretation of the rules found in the robots.txt file:

attr(rt, "problems")
## $on_redirect
## $on_redirect[[1]]
## $on_redirect[[1]]$status
## [1] 301
## 
## $on_redirect[[1]]$location
## [1] "https://petermeissner.de/robots.txt"
## 
## 
## $on_redirect[[2]]
## $on_redirect[[2]]$status
## [1] 200
## 
## $on_redirect[[2]]$location
## NULL

The {httr} request-response object that allwos to dig into what exactly was going on in the client-server exchange.

attr(rt, "request")
## Response [https://petermeissner.de/robots.txt]
##   Date: 2024-09-02 01:32
##   Status: 200
##   Content-Type: text/plain
##   Size: 20 B
## # just do it - punk

… or lets us retrieve the original content given back by the server:

httr::content(
  x        = attr(rt, "request"),
  as       = "text",
  encoding = "UTF-8"
)
## [1] "# just do it - punk\n"

… or have a look at the actual HTTP request issued and all response headers given back by the server:

# extract request-response object
rt_req <- attr(rt, "request")

# HTTP request
rt_req$request
## <request>
## GET http://petermeissner.de/robots.txt
## Output: write_memory
## Options:
## * useragent: libcurl/7.81.0 r-curl/5.2.2 httr/1.4.7
## * ssl_verifypeer: 1
## * httpget: TRUE
## Headers:
## * Accept: application/json, text/xml, application/xml, */*
## * user-agent: R version 4.4.1 (2024-06-14)

# response headers
rt_req$all_headers
## [[1]]
## [[1]]$status
## [1] 301
## 
## [[1]]$version
## [1] "HTTP/1.1"
## 
## [[1]]$headers
## $server
## [1] "nginx/1.10.3 (Ubuntu)"
## 
## $date
## [1] "Mon, 02 Sep 2024 01:32:23 GMT"
## 
## $`content-type`
## [1] "text/html"
## 
## $`content-length`
## [1] "194"
## 
## $connection
## [1] "keep-alive"
## 
## $location
## [1] "https://petermeissner.de/robots.txt"
## 
## attr(,"class")
## [1] "insensitive" "list"       
## 
## 
## [[2]]
## [[2]]$status
## [1] 200
## 
## [[2]]$version
## [1] "HTTP/1.1"
## 
## [[2]]$headers
## $server
## [1] "nginx/1.10.3 (Ubuntu)"
## 
## $date
## [1] "Mon, 02 Sep 2024 01:32:24 GMT"
## 
## $`content-type`
## [1] "text/plain"
## 
## $`content-length`
## [1] "20"
## 
## $`last-modified`
## [1] "Wed, 07 Dec 2022 13:34:14 GMT"
## 
## $connection
## [1] "keep-alive"
## 
## $etag
## [1] "\"63909656-14\""
## 
## $`accept-ranges`
## [1] "bytes"
## 
## attr(,"class")
## [1] "insensitive" "list"

Transformation

For convenience the package also includes a as.list() method for robots.txt files.

as.list(rt)
## $content
## [1] "# just do it - punk\n"
## 
## $robotstxt
## [1] "# just do it - punk\n"
## 
## $problems
## $problems$on_redirect
## $problems$on_redirect[[1]]
## $problems$on_redirect[[1]]$status
## [1] 301
## 
## $problems$on_redirect[[1]]$location
## [1] "https://petermeissner.de/robots.txt"
## 
## 
## $problems$on_redirect[[2]]
## $problems$on_redirect[[2]]$status
## [1] 200
## 
## $problems$on_redirect[[2]]$location
## NULL
## 
## 
## 
## 
## $request
## Response [https://petermeissner.de/robots.txt]
##   Date: 2024-09-02 01:32
##   Status: 200
##   Content-Type: text/plain
##   Size: 20 B
## # just do it - punk

Caching

The retrieval of robots.txt files is cached on a per R-session basis. Restarting an R-session will invalidate the cache. Also using the the function parameter force = TRUE will force the package to re-retrieve the robots.txt file.

paths_allowed("petermeissner.de/I_want_to_scrape_this_now", force = TRUE, verbose = TRUE)
##  petermeissner.de                      rt_robotstxt_http_getter: force http get
## [1] TRUE
paths_allowed("petermeissner.de/I_want_to_scrape_this_now", verbose = TRUE)
##  petermeissner.de                      rt_robotstxt_http_getter: cached http get
## [1] TRUE

More information

ropensci_footer



petermeissner/robotstxt documentation built on Sept. 4, 2024, 6:53 a.m.