get_robotstxt: downloading robots.txt file

Description Usage Arguments

View source: R/get_robotstxt.R

Description

downloading robots.txt file

Usage

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
get_robotstxt(
  domain,
  warn = getOption("robotstxt_warn", TRUE),
  force = FALSE,
  user_agent = utils::sessionInfo()$R.version$version.string,
  ssl_verifypeer = c(1, 0),
  encoding = "UTF-8",
  verbose = FALSE,
  rt_request_handler = robotstxt::rt_request_handler,
  rt_robotstxt_http_getter = robotstxt::get_robotstxt_http_get,
  on_server_error = on_server_error_default,
  on_client_error = on_client_error_default,
  on_not_found = on_not_found_default,
  on_redirect = on_redirect_default,
  on_domain_change = on_domain_change_default,
  on_file_type_mismatch = on_file_type_mismatch_default,
  on_suspect_content = on_suspect_content_default
)

Arguments

domain

domain from which to download robots.txt file

warn

warn about being unable to download domain/robots.txt because of

force

if TRUE instead of using possible cached results the function will re-download the robotstxt file HTTP response status 404. If this happens,

user_agent

HTTP user-agent string to be used to retrieve robots.txt file from domain

ssl_verifypeer

analog to CURL option https://curl.haxx.se/libcurl/c/CURLOPT_SSL_VERIFYPEER.html – and might help with robots.txt file retrieval in some cases

encoding

Encoding of the robots.txt file.

verbose

make function print out more information

rt_request_handler

handler function that handles request according to the event handlers specified

rt_robotstxt_http_getter

function that executes HTTP request

on_server_error

request state handler for any 5xx status

on_client_error

request state handler for any 4xx HTTP status that is not 404

on_not_found

request state handler for HTTP status 404

on_redirect

request state handler for any 3xx HTTP status

on_domain_change

request state handler for any 3xx HTTP status where domain did change as well

on_file_type_mismatch

request state handler for content type other than 'text/plain'

on_suspect_content

request state handler for content that seems to be something else than a robots.txt file (usually a JSON, XML or HTML)


robotstxt documentation built on Sept. 4, 2020, 1:08 a.m.