get_candidates_fromarchivesearchresults: Gets candidate documents from archivesearchresults.

View source: R/text_analysis.R

get_candidates_fromarchivesearchresultsR Documentation

Gets candidate documents from archivesearchresults.

Description

This is somewhat challenging because sometimes the url has 'download' in sometimes it doesn't This function includes a hack to find either case. By default it excludes articles from publications in Ireland and Scotland, and documents already classified as 1 (already downloaded), 3 (verbatim repeat), 4 (Ireland), 5 (Scotland), 6 (Abroad)

Usage

get_candidates_fromarchivesearchresults(
  archivesearchresults,
  include_ocr = FALSE,
  restrict_EW = TRUE,
  restrict_classified = TRUE
)

Arguments

archivesearchresults

The archive search results (including a url column)

restrict_EW

remove results published in Republic of Ireland and Scotland

restrict_classified

remove results already classified as 1, 3, 4, 5 or 6

include

the ocr in the download (will slow down query)

Value

Candidate documents with urls matching the urls in archivesearchresults


gidonc/durhamevp documentation built on April 8, 2022, 10:31 a.m.