get_robotstxts: function to get multiple robotstxt files

Description Usage Arguments

View source: R/get_robotstxts.R

Description

function to get multiple robotstxt files

Usage

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
get_robotstxts(
  domain,
  warn = TRUE,
  force = FALSE,
  user_agent = utils::sessionInfo()$R.version$version.string,
  ssl_verifypeer = c(1, 0),
  use_futures = FALSE,
  verbose = FALSE,
  rt_request_handler = robotstxt::rt_request_handler,
  rt_robotstxt_http_getter = robotstxt::get_robotstxt_http_get,
  on_server_error = on_server_error_default,
  on_client_error = on_client_error_default,
  on_not_found = on_not_found_default,
  on_redirect = on_redirect_default,
  on_domain_change = on_domain_change_default,
  on_file_type_mismatch = on_file_type_mismatch_default,
  on_suspect_content = on_suspect_content_default
)

Arguments

domain

domain from which to download robots.txt file

warn

warn about being unable to download domain/robots.txt because of

force

if TRUE instead of using possible cached results the function will re-download the robotstxt file HTTP response status 404. If this happens,

user_agent

HTTP user-agent string to be used to retrieve robots.txt file from domain

ssl_verifypeer

analog to CURL option https://curl.haxx.se/libcurl/c/CURLOPT_SSL_VERIFYPEER.html – and might help with robots.txt file retrieval in some cases

use_futures

Should future::future_lapply be used for possible parallel/async retrieval or not. Note: check out help pages and vignettes of package future on how to set up plans for future execution because the robotstxt package does not do it on its own.

verbose

make function print out more information

rt_request_handler

handler function that handles request according to the event handlers specified

rt_robotstxt_http_getter

function that executes HTTP request

on_server_error

request state handler for any 5xx status

on_client_error

request state handler for any 4xx HTTP status that is not 404

on_not_found

request state handler for HTTP status 404

on_redirect

request state handler for any 3xx HTTP status

on_domain_change

request state handler for any 3xx HTTP status where domain did change as well

on_file_type_mismatch

request state handler for content type other than 'text/plain'

on_suspect_content

request state handler for content that seems to be something else than a robots.txt file (usually a JSON, XML or HTML)


robotstxt documentation built on Sept. 4, 2020, 1:08 a.m.