cdx_basic_query: Perform a basic/limited Internet Archive CDX resource query...

Description Usage Arguments Details Value Examples

View source: R/cdx_basic.r

Description

CDX files are "Content Index" files. The Wayback CDX server is a standalone HTTP servlet that serves the index that the Wayback machine uses to lookup captures.

Usage

1
2
cdx_basic_query(url, match_type = c("exact", "prefix", "host", "domain"),
  collapse = "urlkey", filter = "statuscode:200", limit = 10000L)

Arguments

url

URL/resource to query for

match_type

The CDX server can also return results matching a certain prefix, a certain host or all subdomains. Can be one of "exact", "prefix", "host", or "domain" (defaults to exact).

collapse

collapse results based on a field, or a substring of a field. Collapsing is done on adjacent cdx lines where all captures after the first one that are duplicate are filtered out. This is useful for filtering out captures that are 'too dense' or when looking for unique captures. To use collapsing, add one or more collapse=field or collapse=field:N where N is the first N characters of field to test. Use NULL for no collapsing Default is to collapse by urlkey (like the web UX). Reference: https://github.com/internetarchive/wayback/tree/master/wayback-cdx-server.

filter

a valid filter string (without the filter= or NULL. The default filter string is statuscode:200 to only retrieve resources with an HTTP 200 (OK) status code. Set to NULL for no filtering.

limit

Maximum number of results to return (first n results). Use a negative number to retrieve the last n results. Default is 10,000.

Details

The index format is known as 'cdx' and contains various fields representing the capture, usually sorted by url and date. http://archive.org/web/researcher/cdx_file_format.php.

Value

data frame

Examples

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
## Not run: 
rproj_basic <- cdx_basic_query("https://www.r-project.org/")

dplyr::glimpse(rproj_basic)
## Observations: 10,000
## Variables: 7
## $ urlkey     <chr> "org,r-project)/", "org,r-project)/", "org,r-project)/"...
## $ timestamp  <dttm> 2000-06-20, 2000-08-16, 2000-10-12, 2000-11-10, 2000-1...
## $ original   <chr> "http://www.r-project.org:80/", "http://www.r-project.o...
## $ mimetype   <chr> "text/html", "text/html", "text/html", "text/html", "te...
## $ statuscode <chr> "200", "200", "200", "200", "200", "200", "200", "200",...
## $ digest     <chr> "XDIHHFDLIWSZFHYHT453ZL5FYPCKFF6Z", "SRO3WSKQS6HST4PQY7...
## $ length     <dbl> 4894, 5027, 589, 581, 582, 596, 590, 592, 592, 592, 563...

## End(Not run)

hrbrmstr/wayback documentation built on May 17, 2019, 5:53 p.m.