OCR_document: Scan PDF with optical character recognition (OCR)

View source: R/03_OCR_document.R

OCR_documentR Documentation

Scan PDF with optical character recognition (OCR)

Description

Extract text contained under image form in a PDF through the use of optical character recognition software (OCR). Currently two options are available, method = "nougat" and method = "tesseract".

Usage

OCR_document(in_path, out_path, method = "nougat", verbose = TRUE)

Arguments

in_path

character. string of a file with species data in either pdf or txt format, e.g: ./folder/file.pdf

out_path

character. Binomial name of the species used with applicable type.

method

character. Method used for the OCR. Currently it defaults to the only available method, nougatOCR.

verbose

logical. Print output after finish.

Details

For now OCR processing of documents is only supported on linux systems.

Value

character. Containing the extracted information.

See Also

arete_setup

Examples

## Not run: 
OCR_document("path/to/file.pdf", "path/to/dir")

## End(Not run)

arete documentation built on Nov. 5, 2025, 6:31 p.m.