GetText | R Documentation |
These functions provide access to the information about each recognized element in a tesseract object. These work at different element levels (characters, words, lines). We can get the recognized elements, their locations in the image, the confidence/certainty of the recognition, and possible alternative characters along with their confidences.
There are methods that work on a the filename of an image or an existing tesseract object.
GetBoxes(obj, level = 3L, keepConfidence = TRUE, asMatrix = FALSE, ...)
GetConfidences(obj, level = 3L, ...)
GetText(obj, level = 3L, ...)
GetAlternatives(obj, ...)
obj |
the name of the file containing the image on which to do
OCR
or the |
level |
the type of element to be recognized - block, paragraph,
word or symbol. This should be a value from the |
keepConfidence |
a logical value. If |
asMatrix |
a logical value that controls whether the bounding box information is returned as a matrix or a data frame. The matrix has the text corresponding to the box as the row names. The data frame has the text as an additional column. |
... |
other arguments that are currently ignored. |
GetBoxes
, by default, returns a data.frame.
This has 6 elements. These are
left , bottom , right , top |
the location of the box as measured from the left top corner, so top and bottom go down the image. That is larger values of top and bottom correspond to lower on the image. Note that top is larger than bottom since it is measured from the origin (0), but is lower on the page as we view it. |
text |
the recognized text |
confidence |
the accuracy/credibility (as a percentage) reported by tesseract |
The basic class of this data frame is OCRResults
and there
is a plot() method for this to display the recovered words as they
appear on the image, and also.
The specific class now identifies the level used to compute the
bounding boxes, e.g. Symbol, Word, Textline, Para, Block.
These class names are SymbolOCRResults, WordOCResults, ...
Methods are currently only defined for the common base class
OCRResults.
Additionally, if keepConfindence
is TRUE
,
an additional S3 class name is inserted - OCRResultsConfidence
.
Alternatively, GetBoxes
can return a matrix with 4 columns
giving the lower-left corner and upper right corner,
xbottom, ybottom, xtop, ytop and the text found in that box as row names.
GetConfidences
returns a numeric value between 0 and 100 indicating
the confidence level of the recognized term.
GetText
returns the text of all the recognized elements.
GetAlternatives
returns a named numeric vector
giving the confidence levels for the possible characters that are
legitimate alternatives to the recognized value. The names
are the alternative characters and the value is the associated confidences.
Duncan Temple Lang
Tesseract https://code.google.com/p/tesseract-ocr/
f = system.file("images", "OCRSample2.png", package = "Rtesseract")
api = tesseract(f)
Recognize(api)
boxes = GetBoxes(api)
rownames(boxes)
boxes
f = system.file("images", "DifferentFonts.png", package = "Rtesseract")
GetBoxes(f, "word")
GetBoxes(f, "symbol")
GetConfidences(f)
ts = tesseract(f)
Recognize(ts)
GetBoxes(ts)
GetConfidences(ts)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.