Description Usage Arguments Details Value See Also Examples
Extract text from a pdf, which may have a text layer that can be extracted
with pdf_text; or which may be image-based and needs
to be OCR'd with Tesseract. Both routes end with the extracted text written
to a .txt file with form-feed (\f
) metacharacters separating pages.
1 2 |
file |
Path to the PDF from which text will be extracted |
thres |
Threshold number of blank pages to be considered mixed [0.2] |
verbose |
Whether to print processing messages [TRUE] |
pre_ocr |
Use text layer if from previous OCR [TRUE] |
force |
Force text extraction even if TXT file exists [TRUE] |
Some PDFs include a mix of pages with and without an embedded text layer. Getting text from the text layer is preferable to OCR (most of the time), and to determine which approach to use,
Nothing
pdf_text load_text
1 2 3 4 | ## Not run:
res <- pdf_to_txt("test.pdf")
## End(Not run)
|
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.