get_gold: Extract text from a PDF with embedded text.

Description Usage Arguments Value Examples

View source: R/gold_std.R

Description

Uses pdftools::pdf_text to get the text layer from PDF 'file', which is used as the 'gold standard' against which OCR'd versions are compared. Checks that the text layer is distilled from the original document rather than a text layer from OCR, e.g., a scanner that OCRs.

Usage

1

Arguments

file

Path to the PDF to be processed

write

Whether to write the text to file [FALSE]

save

Whether to save the text as a .rda [TRUE]

Value

List of pages with text layer if layer not from OCR; else NULL

Examples

1
# res <- get_gold("test.pdf", "GOLDs")

jacob-ogre/ocrerrors documentation built on May 18, 2019, 8:01 a.m.