toPDF: Write Tesseract Results to Various Formats

View source: R/toPDF.R

toPDFR Documentation

Write Tesseract Results to Various Formats

Description

These functions allow us to perform OCR on an image and to have the output written directly to a file in one of several formats. One of these - toPDF - identifies characters/words in the image and creates a searchable, selectable PDF document. Other formats include HTML (with markup including detailed information), OSD (Orientatation and Script Detection), and TSV (tab separated values) with details for each element.

Usage

toPDF(imgFile, outFile = removeExtension(imgFile),
        renderer = PDFRenderer(outFile, GetDataPath(api), ...),
        api = tesseract(, PSM_AUTO), ...)
toHTML(imgFile, fontInfo = TRUE, outFile = removeExtension(imgFile), 
        api = tesseract(, PSM_AUTO))
toHOcr(imgFile, fontInfo = TRUE, outFile = removeExtension(imgFile), 
        api = tesseract(, PSM_AUTO))
toTSV(imgFile, fontInfo = TRUE, outFile = removeExtension(imgFile), 
       api = tesseract(, PSM_AUTO))
toOSD(imgFile, outFile = removeExtension(imgFile), api = tesseract(, PSM_AUTO)) 
toBoxText(imgFile, outFile = removeExtension(imgFile), api = tesseract(, PSM_AUTO))

Arguments

imgFile

a character vector of length 1 giving the name of the image file to process with OCR

outFile

the name of the output file, without an extension. The extension is added by the function.

api

an object of class TesseractBaseAPI-class objected created via a call to tesseract. If this is not provided, a new tesseract instance is created, used and discarded.

fontInfo

a logical value which if TRUE specifies that information about the fonts is included in the output.

renderer

an object of class PDFRenderer. It is almost always true that this should not be passed by the caller. However, it is available if one wants to override the rendering class. Also, it is essential that the renderer is garbage collected so that the contents of the generated file are flushed to the file and it is closed.

...

additional arguments passed to PDFRenderer

Details

These work by creating a renderer object of an appropriate C++ class corresponding to the desired output and then calling the ProcessPages method for the C++ tesseract object with this renderer. The output is written to a file rather than to memory.

Value

A character vector of length 1 containing the full name of the output file generated by the call. This includes the extension tesseract adds to outFile.

Author(s)

Duncan Temple Lang

See Also

tesseract, GetText

Examples

f = system.file("images", "1990_p44.png", package = "Rtesseract")
try( toPDF(f, "tmp") ) # may fail if can't find pdf.ttf. Is this in tesseract  4.0's tessdata

toHTML(f, TRUE, "tmp")

o = toTSV(f, TRUE, "tmp")
d = read.table(o, header = TRUE, fill = TRUE)
names(d)

duncantl/Rtesseract documentation built on March 25, 2022, 5:50 a.m.