robotstxt: A 'robots.txt' Parser and 'Webbot'/'Spider'/'Crawler' Permissions Checker

Provides functions to download and parse 'robots.txt' files. Ultimately the package makes it easy to check if bots (spiders, scrapers, ...) are allowed to access specific resources on a domain.

Author
Peter Meissner [aut, cre], Oliver Keys [ctb], Rich Fitz John [ctb]
Date of publication
2016-04-28 00:36:35
Maintainer
Peter Meissner <retep.meissner@gmail.com>
License
MIT + file LICENSE
Version
0.3.2
URLs

View on CRAN

Man pages

get_robotstxt
downloading robots.txt file
guess_domain
function guessing domain from path
named_list
make automatically named list
parse_robotstxt
function parsing robots.txt
path_allowed
check if a bot has permissions to access page
paths_allowed
check if a bot has permissions to access page(s)
print.robotstxt
printing robotstxt
print.robotstxt_text
printing robotstxt_text
remove_domain
function to remove domain from path
robotstxt
Generate a representations of a robots.txt file
rt_cache
get_robotstxt() cache
rt_get_comments
extrcting comments from robots.txt
rt_get_fields
extracting permissions from robots.txt
rt_get_fields_worker
extracting robotstxt fields
rt_get_rtxt
load robots.txt files saved along with the package
rt_get_useragent
extracting HTTP useragents from robots.txt
rt_list_rtxt
list robots.txt files saved along with the package
sanitize_path
making paths uniform
sanitize_permissions
transforming permissions into regular expressions (whole...
sanitize_permission_values
transforming permissions into regular expressions (values)

Files in this package

robotstxt
robotstxt/inst
robotstxt/inst/robotstxts
robotstxt/inst/robotstxts/robots_new_york_times.txt
robotstxt/inst/robotstxts/disallow_all_for_BadBot.txt
robotstxt/inst/robotstxts/robots_bundestag.txt
robotstxt/inst/robotstxts/robots_pmeissner.txt
robotstxt/inst/robotstxts/robots_wikipedia.txt
robotstxt/inst/robotstxts/robots_yahoo.txt
robotstxt/inst/robotstxts/disallow_some_for_all.txt
robotstxt/inst/robotstxts/disallow_two_at_once.txt
robotstxt/inst/robotstxts/selfhtml_Example.txt
robotstxt/inst/robotstxts/robots_google.txt
robotstxt/inst/robotstxts/host.txt
robotstxt/inst/robotstxts/allow_single_bot.txt
robotstxt/inst/robotstxts/crawl_delay.txt
robotstxt/inst/robotstxts/empty.txt
robotstxt/inst/robotstxts/disallow_all_for_all.txt
robotstxt/inst/robotstxts/testing_comments.txt
robotstxt/inst/robotstxts/robots_spiegel.txt
robotstxt/inst/robotstxts/robots_amazon.txt
robotstxt/inst/doc
robotstxt/inst/doc/using_robotstxt.html
robotstxt/inst/doc/using_robotstxt.R
robotstxt/inst/doc/using_robotstxt.Rmd
robotstxt/tests
robotstxt/tests/testthat.R
robotstxt/tests/testthat
robotstxt/tests/testthat/test_parser.R
robotstxt/tests/testthat/test_permissions.R
robotstxt/tests/testthat/test_robotstxt.R
robotstxt/NAMESPACE
robotstxt/NEWS
robotstxt/R
robotstxt/R/parse_robotstxt.R
robotstxt/R/tools.R
robotstxt/R/robotstxt.R
robotstxt/R/permissions.R
robotstxt/vignettes
robotstxt/vignettes/using_robotstxt.Rmd
robotstxt/README.md
robotstxt/MD5
robotstxt/build
robotstxt/build/vignette.rds
robotstxt/DESCRIPTION
robotstxt/man
robotstxt/man/sanitize_permissions.Rd
robotstxt/man/print.robotstxt_text.Rd
robotstxt/man/rt_get_comments.Rd
robotstxt/man/paths_allowed.Rd
robotstxt/man/parse_robotstxt.Rd
robotstxt/man/guess_domain.Rd
robotstxt/man/rt_list_rtxt.Rd
robotstxt/man/rt_get_rtxt.Rd
robotstxt/man/remove_domain.Rd
robotstxt/man/path_allowed.Rd
robotstxt/man/get_robotstxt.Rd
robotstxt/man/rt_get_fields_worker.Rd
robotstxt/man/sanitize_permission_values.Rd
robotstxt/man/sanitize_path.Rd
robotstxt/man/robotstxt.Rd
robotstxt/man/print.robotstxt.Rd
robotstxt/man/rt_cache.Rd
robotstxt/man/named_list.Rd
robotstxt/man/rt_get_useragent.Rd
robotstxt/man/rt_get_fields.Rd
robotstxt/LICENSE