paths_allowed: check if a bot has permissions to access page(s)

Description Usage Arguments See Also

View source: R/paths_allowed.R

Description

wrapper to path_allowed

Usage

1
2
3
4
paths_allowed(paths = "/", domain = "auto", bot = "*",
  user_agent = utils::sessionInfo()$R.version$version.string,
  check_method = c("robotstxt", "spiderbar"), warn = TRUE, force = FALSE,
  ssl_verifypeer = c(1, 0), use_futures = TRUE, robotstxt_list = NULL)

Arguments

paths

paths for which to check bot's permission, defaults to "/"

domain

Domain for which paths should be checked. Defaults to "auto". If set to "auto" function will try to guess the domain by parsing the paths argument. Note however, that these are educated guesses which might utterly fail. To be on the save side, provide appropriate domains manually.

bot

name of the bot, defaults to "*"

user_agent

HTTP user-agent string to be used to retrieve robots.txt file from domain

check_method

which method to use for checking – either "robotstxt" for the package's own method or "spiderbar" for using spiderbar::can_fetch; note that at the current state spiderbar is considered less accurate: the spiderbar algorithm will only take into consideration rules for * or a particular bot but does not merge rules together (see: paste0(system.file("robotstxts", package = "robotstxt"),"/selfhtml_Example.txt"))

warn

warn about being unable to download domain/robots.txt because of

force

if TRUE instead of using possible cached results the function will re-download the robotstxt file HTTP response status 404. If this happens,

ssl_verifypeer

analog to CURL option https://curl.haxx.se/libcurl/c/CURLOPT_SSL_VERIFYPEER.html – and might help with robots.txt file retrieval in some cases

use_futures

Should future::future_lapply be used for possible parallel/async retrieval or not. Note: check out help pages and vignettes of package future on how to set up plans for future execution because the robotstxt package does not do it on its own.

robotstxt_list

either NULL – the default – or a list of character vectors with one vector per path to check

See Also

path_allowed


robotstxt documentation built on Nov. 17, 2017, 8:14 a.m.