toPDF: Write Tesseract Results to Various Formats
In duncantl/Rtesseract: Interface to the tesseract OCR system

toPDF

R Documentation

Write Tesseract Results to Various Formats

Description

These functions allow us to perform OCR on an image and to have the output written directly to a file in one of several formats. One of these - toPDF - identifies characters/words in the image and creates a searchable, selectable PDF document. Other formats include HTML (with markup including detailed information), OSD (Orientatation and Script Detection), and TSV (tab separated values) with details for each element.

Usage

toPDF(imgFile, outFile = removeExtension(imgFile),
        renderer = PDFRenderer(outFile, GetDataPath(api), ...),
        api = tesseract(, PSM_AUTO), ...)
toHTML(imgFile, fontInfo = TRUE, outFile = removeExtension(imgFile), 
        api = tesseract(, PSM_AUTO))
toHOcr(imgFile, fontInfo = TRUE, outFile = removeExtension(imgFile), 
        api = tesseract(, PSM_AUTO))
toTSV(imgFile, fontInfo = TRUE, outFile = removeExtension(imgFile), 
       api = tesseract(, PSM_AUTO))
toOSD(imgFile, outFile = removeExtension(imgFile), api = tesseract(, PSM_AUTO)) 
toBoxText(imgFile, outFile = removeExtension(imgFile), api = tesseract(, PSM_AUTO))

Arguments

`imgFile`	a character vector of length 1 giving the name of the image file to process with OCR
`outFile`	the name of the output file, without an extension. The extension is added by the function.
`api`	an object of class `TesseractBaseAPI-class` objected created via a call to `tesseract`. If this is not provided, a new tesseract instance is created, used and discarded.
`fontInfo`	a logical value which if `TRUE` specifies that information about the fonts is included in the output.
`renderer`	an object of class PDFRenderer. It is almost always true that this should not be passed by the caller. However, it is available if one wants to override the rendering class. Also, it is essential that the renderer is garbage collected so that the contents of the generated file are flushed to the file and it is closed.
`...`	additional arguments passed to `PDFRenderer`

Details

These work by creating a renderer object of an appropriate C++ class corresponding to the desired output and then calling the ProcessPages method for the C++ tesseract object with this renderer. The output is written to a file rather than to memory.

Value

A character vector of length 1 containing the full name of the output file generated by the call. This includes the extension tesseract adds to outFile.

Author(s)

Duncan Temple Lang

Examples

f = system.file("images", "1990_p44.png", package = "Rtesseract")
try( toPDF(f, "tmp") ) # may fail if can't find pdf.ttf. Is this in tesseract  4.0's tessdata

toHTML(f, TRUE, "tmp")

o = toTSV(f, TRUE, "tmp")
d = read.table(o, header = TRUE, fill = TRUE)
names(d)

duncantl/Rtesseract documentation built on Sept. 8, 2024, 8:38 a.m.