GetText: Query the Current OCR Result to Get Text, Bounding Boxes,...

View source: R/ext.R

GetTextR Documentation

Query the Current OCR Result to Get Text, Bounding Boxes, Confidence or Alternative Characters

Description

These functions provide access to the information about each recognized element in a tesseract object. These work at different element levels (characters, words, lines). We can get the recognized elements, their locations in the image, the confidence/certainty of the recognition, and possible alternative characters along with their confidences.

There are methods that work on a the filename of an image or an existing tesseract object.

Usage

GetBoxes(obj, level = 3L, keepConfidence = TRUE, asMatrix = FALSE, ...)
GetConfidences(obj, level = 3L, ...)
GetText(obj, level = 3L, ...)
GetAlternatives(obj, ...)

Arguments

obj

the name of the file containing the image on which to do OCR or the TesseractBaseAPI-class class, obtained via a call to tesseract

level

the type of element to be recognized - block, paragraph, word or symbol. This should be a value from the PageIteratorLevel enumerated constants. One can specify it as a value or a name from that vector.

keepConfidence

a logical value. If TRUE, the final column of the matrix returned by GetBoxes is the confidence associated with the text as determined by the OCR.

asMatrix

a logical value that controls whether the bounding box information is returned as a matrix or a data frame. The matrix has the text corresponding to the box as the row names. The data frame has the text as an additional column.

...

other arguments that are currently ignored.

Value

GetBoxes, by default, returns a data.frame. This has 6 elements. These are

left,bottom,right,top

the location of the box as measured from the left top corner, so top and bottom go down the image. That is larger values of top and bottom correspond to lower on the image. Note that top is larger than bottom since it is measured from the origin (0), but is lower on the page as we view it.

text

the recognized text

confidence

the accuracy/credibility (as a percentage) reported by tesseract

The basic class of this data frame is OCRResults and there is a plot() method for this to display the recovered words as they appear on the image, and also. The specific class now identifies the level used to compute the bounding boxes, e.g. Symbol, Word, Textline, Para, Block. These class names are SymbolOCRResults, WordOCResults, ... Methods are currently only defined for the common base class OCRResults.

Additionally, if keepConfindence is TRUE, an additional S3 class name is inserted - OCRResultsConfidence.

Alternatively, GetBoxes can return a matrix with 4 columns giving the lower-left corner and upper right corner, xbottom, ybottom, xtop, ytop and the text found in that box as row names.

GetConfidences returns a numeric value between 0 and 100 indicating the confidence level of the recognized term.

GetText returns the text of all the recognized elements.

GetAlternatives returns a named numeric vector giving the confidence levels for the possible characters that are legitimate alternatives to the recognized value. The names are the alternative characters and the value is the associated confidences.

Author(s)

Duncan Temple Lang

References

Tesseract https://code.google.com/p/tesseract-ocr/

Examples

 f = system.file("images", "OCRSample2.png", package = "Rtesseract")
 api = tesseract(f)
 Recognize(api)
 boxes = GetBoxes(api)
 rownames(boxes)
 boxes

f = system.file("images", "DifferentFonts.png", package = "Rtesseract")
GetBoxes(f, "word")
GetBoxes(f, "symbol")
GetConfidences(f)

ts = tesseract(f)
Recognize(ts)
GetBoxes(ts)
GetConfidences(ts)

duncantl/Rtesseract documentation built on March 25, 2022, 5:50 a.m.