knitr::opts_chunk$set( collapse = TRUE, comment = "##", fig.path = "README-" )
options("width"=110) tmp <- packageDescription( basename(getwd()) )
cat("##", tmp$Title)
filelist.R <- list.files("R", recursive = TRUE, pattern="\\.R$", ignore.case = TRUE, full.names = TRUE) filelist.tests <- list.files("tests", recursive = TRUE, pattern="\\.R$", ignore.case = TRUE, full.names = TRUE) filelist.cpp <- list.files("src", recursive = TRUE, pattern="\\.cpp$", ignore.case = TRUE, full.names = TRUE) lines.R <- unlist(lapply(filelist.R, readLines, warn = FALSE)) lines.tests <- unlist(lapply(filelist.tests, readLines, warn = FALSE)) lines.cpp <- unlist(lapply(filelist.cpp, readLines, warn = FALSE)) length.R <- length(grep("(^\\s*$)|(^\\s*#)|(^\\s*//)", lines.R, value = TRUE, invert = TRUE)) length.tests <- length(grep("(^\\s*$)|(^\\s*#)|(^\\s*//)", lines.tests, value = TRUE, invert = TRUE)) length.cpp <- length(grep("(^\\s*$)|(^\\s*#)|(^\\s*//)", lines.cpp, value = TRUE, invert = TRUE))
Status
lines of R code: r length.R
, lines of test code: r length.tests
Development version
source_files <- grep( "/R/|/src/|/tests/", list.files(recursive = TRUE, full.names = TRUE), value = TRUE ) last_change <- as.character( format(max(file.info(source_files)$mtime), tz="UTC") )
cat(tmp$Version) cat(" - ") cat(stringr::str_replace(last_change, " ", " / "))
Description
cat(tmp$Description)
License
cat(tmp$License, "<br>") cat(tmp$Author)
Citation
citation("robotstxt")
BibTex for citing
toBibtex(citation("robotstxt"))
Contribution - AKA The-Think-Twice-Be-Nice-Rule
Please note that this project is released with a Contributor Code of Conduct. By participating in this project you agree to abide by its terms:
As contributors and maintainers of this project, we pledge to respect all people who contribute through reporting issues, posting feature requests, updating documentation, submitting pull requests or patches, and other activities.
We are committed to making participation in this project a harassment-free experience for everyone, regardless of level of experience, gender, gender identity and expression, sexual orientation, disability, personal appearance, body size, race, ethnicity, age, or religion.
Examples of unacceptable behavior by participants include the use of sexual language or imagery, derogatory comments or personal attacks, trolling, public or private harassment, insults, or other unprofessional conduct.
Project maintainers have the right and responsibility to remove, edit, or reject comments, commits, code, wiki edits, issues, and other contributions that are not aligned to this Code of Conduct. Project maintainers who do not follow the Code of Conduct may be removed from the project team.
Instances of abusive, harassing, or otherwise unacceptable behavior may be reported by opening an issue or contacting one or more of the project maintainers.
This Code of Conduct is adapted from the Contributor Covenant (https://www.contributor-covenant.org/), version 1.0.0, available at https://www.contributor-covenant.org/version/1/0/0/code-of-conduct/
Installation and start - stable version
install.packages("robotstxt") library(robotstxt)
Installation and start - development version
devtools::install_github("ropensci/robotstxt") library(robotstxt)
Robotstxt class documentation
?robotstxt
Simple path access right checking (the functional way) ...
library(robotstxt) options(robotstxt_warn = FALSE) paths_allowed( paths = c("/api/rest_v1/?doc", "/w/"), domain = "wikipedia.org", bot = "*" ) paths_allowed( paths = c( "https://wikipedia.org/api/rest_v1/?doc", "https://wikipedia.org/w/" ) )
... or (the object oriented way) ...
library(robotstxt) options(robotstxt_warn = FALSE) rtxt <- robotstxt(domain = "wikipedia.org") rtxt$check( paths = c("/api/rest_v1/?doc", "/w/"), bot = "*" )
Retrieving the robots.txt file for a domain:
# retrieval rt <- get_robotstxt("https://petermeissner.de") # printing rt
Checking whether or not one is supposadly allowed to access some resource from a web server is - unfortunately - not just a matter of downloading and parsing a simple robots.txt file.
First there is no official specification for robots.txt files so every robots.txt file written and every robots.txt file read and used is an interpretation. Most of the time we all have a common understanding on how things are supposed to work but things get more complicated at the edges.
Some interpretation problems:
Because the interpretation of robots.txt rules not just depends on the rules specified within the file, the package implements an event handler system that allows to interpret and re-interpret events into rules.
Under the hood the rt_request_handler()
function is called within get_robotstxt()
.
This function takes an {httr} request-response object and a set of event handlers.
Processing the request and the handlers it checks for various events and states
around getting the file and reading in its content. If an event/state happened
the event handlers are passed on to the request_handler_handler()
along for
problem resolution and collecting robots.txt file transformations:
Event handler rules can either consist of 4 items or can be functions - the former being the usual case and that used throughout the package itself.
Functions like paths_allowed()
do have parameters that allow passing
along handler rules or handler functions.
Handler rules are lists with the following items:
over_write_file_with
: if the rule is triggered and has higher priority than those rules applied beforehand (i.e. the new priority has an higher value than the old priority) than the robots.txt file retrieved will be overwritten by this character vectorsignal
: might be "message"
, "warning"
, or "error"
and will use the signal function to signal the event/state just handled. Signaling a warning or a message might be suppressed by setting the function paramter warn = FALSE
.cache
should the package be allowed to cache the results of the retrieval or notpriority
the priority of the rule specified as numeric value, rules with higher priority will be allowed to overwrite robots.txt file content changed by rules with lower priorityThe package knows the following rules with the following defaults:
on_server_error
: on_server_error_default
on_client_error
: on_client_error_default
on_not_found
: on_not_found_default
on_redirect
: on_redirect_default
on_domain_change
: on_domain_change_default
on_file_type_mismatch
: on_file_type_mismatch_default
on_suspect_content
:on_suspect_content_default
from version 0.7.x onwards
While previous releases were concerned with implementing parsing and permission checking and improving performance the 0.7.x release will be about robots.txt retrieval foremost. While retrieval was implemented there are corner cases in the retrieval stage that very well influence the interpretation of permissions granted.
Features and Problems handled:
Design Decisions
By default all functions retrieving robots.txt files will warn if there are
The warnings in the following example can be turned of in three ways:
options("robotstxt_warn" = TRUE)
(example)
library(robotstxt) paths_allowed("petermeissner.de")
(solution 1)
library(robotstxt) suppressWarnings({ paths_allowed("petermeissner.de") })
(solution 2)
library(robotstxt) paths_allowed("petermeissner.de", warn = FALSE)
(solution 3)
library(robotstxt) options(robotstxt_warn = FALSE) paths_allowed("petermeissner.de")
The robots.txt files retrieved are basically mere character vectors:
rt <- get_robotstxt("petermeissner.de") as.character(rt) cat(rt)
The last HTTP request is stored in an object
rt_last_http$request
But they also have some additional information stored as attributes.
names(attributes(rt))
Events that might change the interpretation of the rules found in the robots.txt file:
attr(rt, "problems")
The {httr} request-response object that allwos to dig into what exactly was going on in the client-server exchange.
attr(rt, "request")
... or lets us retrieve the original content given back by the server:
httr::content( x = attr(rt, "request"), as = "text", encoding = "UTF-8" )
... or have a look at the actual HTTP request issued and all response headers given back by the server:
# extract request-response object rt_req <- attr(rt, "request") # HTTP request rt_req$request # response headers rt_req$all_headers
For convenience the package also includes a as.list()
method for robots.txt files.
as.list(rt)
The retrieval of robots.txt files is cached on a per R-session basis.
Restarting an R-session will invalidate the cache. Also using the the
function parameter froce = TRUE
will force the package to re-retrieve the
robots.txt file.
paths_allowed("petermeissner.de/I_want_to_scrape_this_now", force = TRUE, verbose = TRUE) paths_allowed("petermeissner.de/I_want_to_scrape_this_now",verbose = TRUE)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.