Archivr is a project by the Qualitative Data Repository that automates preservation of urls in Web Archives.
The easiest way to install is directly from this github using the devtools
package:
library(devtools)
install_github("QualitativeDataRepository/archivr")
library(archivr)
The basic function is archiv
that takes a list of urls and stores them in
the Way Back Machine. It will return a dataframe containing the callback
data for the service.
arc_df <- archiv(list("www.example.com", "NOTAURL", "www.github.com"))
arc_df$way_back_url
# wayback_url
# 1 http://web.archive.org/web/20190128171132/http://www.example.com
# 2 http://web.archive.org/web/20190128171134/https://github.com/ ...
Archiv can archive all the urls in a webpage.
arc_url_df <- archiv.fromUrl("https://qdr.syr.edu/")
df <- data.frame(arc_url_df$url, arc_url_df$wayback_url)[8,]
# arc_url_df.url arc_url_df.wayback_url
# 8 http://syr.edu http://web.archive.org/web/20170110050058/http://syr.edu/
Archiv will also archive all the urls in a text file. It has been tested for docx, pdf and markdown, although other text-related files should also work. Note that text parsing can be subject to problems, especially if the document has rich features such as tables or columns.
arc_url_df <- archiv.fromText("path_to_file")
To allow for pre-processing of URLs before archiving, archivr
also provides access to the funcitons used to extract URLs from a webpage (extract_urls_from_webpage("URL")
), from a files (extract_urls_from_text("filepath")
) (tested for .docx, markdown, and pdf), and from any supported text file in a folder (extract_urls_from_folder("filepath")
)
Any of the functions that extract or archiv URLs from a document or URL, accept an except
parameter, a regular expression (using R's grepl
function) that will exclude URLs from extraction and archiving. E.g.
arc_url_df <- archiv.fromText("article.pdf", except="https?:\\/\\/(dx\\.)?doi\\.org\\/")
will exclude DOI links from archiving.
You can check whether URLs are archived by the Internet Archive's Wayback machine:
arc_url_df <- view_archiv(list("www.example.com", "NOTAURL", "www.github.com"), "wayback")
If you wish to use perma.cc's archive, you will need to set your api key using:
set_api_key("YOUR_API_KEY")
if you wish to save the urls in a particular perma.cc folder, you will need to set the default folder id using
set_folder_id("FOLDER_ID")
If you do not remember the ids of your folders, you can retrieve these in a dataframe using:
get_folder_ids()
You can check your current folder using
get_folder_id()
and then you can archive materials:
arc_df <- archiv(list("www.example.com", "NOTAURL", "www.github.com"), "perma_cc")
To check if a list of urls are archived in perma_cc's public api, use:
arc_url_df <- view_archiv(list("www.example.com", "NOTAURL", "www.github.com"), "perma_cc")
Archivr is a project developed and maintained by the Qualitative Data Repository at Syracuse University, originally authored by Ryan Deschamps (greebie on github.com) and Agile Humanities.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.