toPDF | R Documentation |
These functions allow us to perform OCR on an image and to have the
output written directly to a file in one of several formats.
One of these - toPDF
- identifies characters/words in the image
and creates a searchable, selectable PDF document.
Other formats include HTML (with markup including detailed
information), OSD (Orientatation and Script Detection),
and TSV (tab separated values) with details for each element.
toPDF(imgFile, outFile = removeExtension(imgFile),
renderer = PDFRenderer(outFile, GetDataPath(api), ...),
api = tesseract(, PSM_AUTO), ...)
toHTML(imgFile, fontInfo = TRUE, outFile = removeExtension(imgFile),
api = tesseract(, PSM_AUTO))
toHOcr(imgFile, fontInfo = TRUE, outFile = removeExtension(imgFile),
api = tesseract(, PSM_AUTO))
toTSV(imgFile, fontInfo = TRUE, outFile = removeExtension(imgFile),
api = tesseract(, PSM_AUTO))
toOSD(imgFile, outFile = removeExtension(imgFile), api = tesseract(, PSM_AUTO))
toBoxText(imgFile, outFile = removeExtension(imgFile), api = tesseract(, PSM_AUTO))
imgFile |
a character vector of length 1 giving the name of the image file to process with OCR |
outFile |
the name of the output file, without an extension. The extension is added by the function. |
api |
an object of class
|
fontInfo |
a logical value which if |
renderer |
an object of class PDFRenderer. It is almost always true that this should not be passed by the caller. However, it is available if one wants to override the rendering class. Also, it is essential that the renderer is garbage collected so that the contents of the generated file are flushed to the file and it is closed. |
... |
additional arguments passed to |
These work by creating a renderer object of an appropriate C++ class
corresponding to the desired output and then calling the
ProcessPages
method for the C++ tesseract object with this
renderer. The output is written to a file rather than to memory.
A character vector of length 1 containing the full name of the output
file generated by the call. This includes the extension tesseract adds
to outFile
.
Duncan Temple Lang
tesseract
, GetText
f = system.file("images", "1990_p44.png", package = "Rtesseract")
try( toPDF(f, "tmp") ) # may fail if can't find pdf.ttf. Is this in tesseract 4.0's tessdata
toHTML(f, TRUE, "tmp")
o = toTSV(f, TRUE, "tmp")
d = read.table(o, header = TRUE, fill = TRUE)
names(d)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.