pdf_to_txt: Extract text from a pdf and write to a txt file

Description Usage Arguments Details Value See Also Examples

Description

Extract text from a pdf, which may have a text layer that can be extracted with pdf_text; or which may be image-based and needs to be OCR'd with Tesseract. Both routes end with the extracted text written to a .txt file with form-feed (\f) metacharacters separating pages.

Usage

1
2
pdf_to_txt(file, thres = 0.2, verbose = TRUE, pre_ocr = TRUE,
  force = TRUE)

Arguments

file

Path to the PDF from which text will be extracted

thres

Threshold number of blank pages to be considered mixed [0.2]

verbose

Whether to print processing messages [TRUE]

pre_ocr

Use text layer if from previous OCR [TRUE]

force

Force text extraction even if TXT file exists [TRUE]

Details

Some PDFs include a mix of pages with and without an embedded text layer. Getting text from the text layer is preferable to OCR (most of the time), and to determine which approach to use,

Value

Nothing

See Also

pdf_text load_text

Examples

1
2
3
4
## Not run: 
res <- pdf_to_txt("test.pdf")

## End(Not run)

jacob-ogre/pdftext documentation built on May 18, 2019, 8:01 a.m.