robotstxt: A 'robots.txt' Parser and 'Webbot'/'Spider'/'Crawler' Permissions Checker
Version 0.3.2

Provides functions to download and parse 'robots.txt' files. Ultimately the package makes it easy to check if bots (spiders, scrapers, ...) are allowed to access specific resources on a domain.

AuthorPeter Meissner [aut, cre], Oliver Keys [ctb], Rich Fitz John [ctb]
Date of publication2016-12-05 18:28:48
MaintainerPeter Meissner <retep.meissner@gmail.com>
LicenseMIT + file LICENSE
Version0.3.2
URL https://github.com/ropenscilabs/robotstxt
Package repositoryView on CRAN
InstallationInstall the latest version of this package by entering the following in R:
install.packages("robotstxt")

Getting started

README.md
Using Robotstxt

Popular man pages

named_list: make automatically named list
path_allowed: check if a bot has permissions to access page
robotstxt: Generate a representations of a robots.txt file
rt_cache: get_robotstxt() cache
rt_get_fields: extracting permissions from robots.txt
sanitize_path: making paths uniform
sanitize_permission_values: transforming permissions into regular expressions (values)
See all...

All man pages Function index File listing

Man pages

get_robotstxt: downloading robots.txt file
guess_domain: function guessing domain from path
named_list: make automatically named list
parse_robotstxt: function parsing robots.txt
path_allowed: check if a bot has permissions to access page
paths_allowed: check if a bot has permissions to access page(s)
print.robotstxt: printing robotstxt
print.robotstxt_text: printing robotstxt_text
remove_domain: function to remove domain from path
robotstxt: Generate a representations of a robots.txt file
rt_cache: get_robotstxt() cache
rt_get_comments: extrcting comments from robots.txt
rt_get_fields: extracting permissions from robots.txt
rt_get_fields_worker: extracting robotstxt fields
rt_get_rtxt: load robots.txt files saved along with the package
rt_get_useragent: extracting HTTP useragents from robots.txt
rt_list_rtxt: list robots.txt files saved along with the package
sanitize_path: making paths uniform
sanitize_permissions: transforming permissions into regular expressions (whole...
sanitize_permission_values: transforming permissions into regular expressions (values)

Functions

Files

inst
inst/robotstxts
inst/robotstxts/robots_new_york_times.txt
inst/robotstxts/disallow_all_for_BadBot.txt
inst/robotstxts/robots_bundestag.txt
inst/robotstxts/robots_pmeissner.txt
inst/robotstxts/robots_wikipedia.txt
inst/robotstxts/robots_yahoo.txt
inst/robotstxts/disallow_some_for_all.txt
inst/robotstxts/disallow_two_at_once.txt
inst/robotstxts/selfhtml_Example.txt
inst/robotstxts/robots_google.txt
inst/robotstxts/host.txt
inst/robotstxts/allow_single_bot.txt
inst/robotstxts/crawl_delay.txt
inst/robotstxts/empty.txt
inst/robotstxts/disallow_all_for_all.txt
inst/robotstxts/testing_comments.txt
inst/robotstxts/robots_spiegel.txt
inst/robotstxts/robots_amazon.txt
inst/doc
inst/doc/using_robotstxt.html
inst/doc/using_robotstxt.R
inst/doc/using_robotstxt.Rmd
tests
tests/testthat.R
tests/testthat
tests/testthat/test_parser.R
tests/testthat/test_permissions.R
tests/testthat/test_robotstxt.R
NAMESPACE
NEWS
R
R/parse_robotstxt.R
R/tools.R
R/robotstxt.R
R/permissions.R
vignettes
vignettes/using_robotstxt.Rmd
README.md
MD5
build
build/vignette.rds
DESCRIPTION
man
man/sanitize_permissions.Rd
man/print.robotstxt_text.Rd
man/rt_get_comments.Rd
man/paths_allowed.Rd
man/parse_robotstxt.Rd
man/guess_domain.Rd
man/rt_list_rtxt.Rd
man/rt_get_rtxt.Rd
man/remove_domain.Rd
man/path_allowed.Rd
man/get_robotstxt.Rd
man/rt_get_fields_worker.Rd
man/sanitize_permission_values.Rd
man/sanitize_path.Rd
man/robotstxt.Rd
man/print.robotstxt.Rd
man/rt_cache.Rd
man/named_list.Rd
man/rt_get_useragent.Rd
man/rt_get_fields.Rd
LICENSE
robotstxt documentation built on May 19, 2017, 7 a.m.

Questions? Problems? Suggestions? Tweet to @rdrrHQ or email at ian@mutexlabs.com.

Please suggest features or report bugs in the GitHub issue tracker.

All documentation is copyright its authors; we didn't write any of that.