tesseract | R Documentation |
The ocr
function hides all of the details of the OCR mechanism.
The tesseract
function provides an entry to much more control
and detail about the process and the ability to query the results in
different ways. The tesseract
function is the building
block for all the functionality of the package.
To perform OCR on an image, we need a Tesseract API object
and then can call its methods.
We create such an object with the tesseract
function.
We typically specify the image, the segmentation mode for identifying the
elements in the page, and also the language for the content.
We can also set any of the over 600 variables that control how
tesseract operates on the image. (See PrintVariables
.)
We can create a tesseract object without specifying an image, the segmentation mode, etc. Instead, we can set this later when we use the tesseract object, and we can also update these values to reuse the same tesseract object on different images, with different segmentation modes, etc. Alternatively, we can create new separate tesseract objects that work on different images or in different segmentation modes.
The tesseract
function, by default, performs the initialization
of the internal C++ object. This can be deferred if necessary (with
init = FALSE
. Also, a tesseract object can be re-initialized.
One has to be careful when initializing a tesseract object because
it will reset the page segmentation mode to a default value.
So it is vital to set that after initialization. The tesseract
function takes care of this.
We can set tesseract variables at any point in the lifetime of the
tesseract object, either in the call to tesseract
,
Init
or SetVariables
.
However, there are some variables that need to be explicitly set in
the call to Init
or before.
There are many functions that operate on a TesseractBaseAPI
object that provide the interesting functionality for the package.
These are noted in the see also section and in other help pages,
e.g., GetDatapath
, GetVariables
,
SetVariables
.
We can load configuration files of variable settings,
set the image to be processed, the page segmentation
mode and the resolution specification of the image.
We can also query these.
A potentially very useful facility is to specify the
sub-region of the current image for tesseract to analyze
with SetRectangle
. This allows us to zoom in
and redo the OCR within this constrained context.
When we have configured the tesseract object appropriately,
we call Recognize
to perform the OCR.
We then collect the results in which we are interested.
We can extract the bounding boxes of the elements it recognizes. We can get the confidence associated with each match it made. We can obtain the alternatives or "next best" guesses the OCR had for each element.
We can switch the image to be processed, the page segmentation mode, quey the file name of the current image, get the image itself, query the current variable settings for tesseract, ask if a particular word is considered valid.
We can also zoom in to a sub-region of the image and run the OCR again on this part.
We can query the current state and settings of the tesseract object.
The package also provides a way to plot a tesseract object, i.e., show the image and superimpose the bounding boxes of the recognized elements in the image. One can color the rectangles by their confidence/accuracy levels. In the future, we may develop an interactive plot to allow users to visually explore and compare the original image and the OCR text.
We can also treat the TesseractBaseAPI
as a virtual list
and then use lapply
to iterate over its elements.
There are functions within the package that
can be used within the lapply
to extract the
individual elements of the results.
These are BoundingBox
, GetAlternatives
, ...
There are equivalent higher-level functions to access these elements
more efficiently, but with less flexibility.
tesseract(image = character(), pageSegMode = integer(), lang = "eng",
datapath = NA, configs = character(), vars = character(),
engineMode = OEM_DEFAULT, debugOnly = FALSE,
..., opts = list(...), init = TRUE)
Init(api, lang = "eng", configs = character(), vars =
character(), datapath = NA, engineMode = OEM_DEFAULT,
debugOnly = FALSE, force2 = TRUE)
Recognize(api)
image |
either a |
api |
the instance of the |
lang |
a string identifying the language(s) for the character recognition |
datapath |
the name of a directory that contains the tessdata/ directory. |
engineMode |
the mode for the OCR engine. The default is to use
tesseract. One can use a Cube method, or a Cube and Tesseract
combination. See the |
configs |
a named character vector of configuration arguments |
vars |
variables to set for controlling Tesseract |
debugOnly |
a logical value that controls whether in the Init()
call, only non-debugging variables in |
... |
|
opts |
a list (or vector) of named values that are the options
we can pass via |
init |
a logical value controlling whether |
pageSegMode |
the value for the page segmentation mode for the tesseract
instance. This must correspond to one of the values in
|
force2 |
a logical value to control whether to use the Init2 routine. Should never be needed. |
The tesseract
function returns an object of class
TesseractBaseAPI
. This is an S4 object that contains
an opaque reference to a C++ object. It should be used in other calls
expecting the tesseract instance.
For tesseract
, if the file format of the image is not supported
by the installed leptonica library being used, an error of class UnsupportedImageFormat
is raised.
The error message indicates the supported image types, and the
error object also contains the name of the file in the filename
element.
Init
returns a logical value indicating whether the
initialization was successful TRUE
, or raises
an error of class TesseractInitFailure
.
This provides an opportunity to catch the error and use a different
initialization approach (e.g. different language, variables, engine
mode).
Recognize
returns TRUE
if the call was successful.
Duncan Temple Lang
Tesseract https://code.google.com/p/tesseract-ocr/, specifically http://zdenop.github.io/tesseract-doc/classtesseract_1_1_tess_base_a_p_i.html
GetText
GetBoxes
GetConfidences
GetAlternatives
SetVariables
, PrintVariables
,
SetInputName
, SetImage
,
SetOutputName
,
ReadConfigFile
,
SetVariables
, GetVariables
,
PrintVariables
,
SetPageSegMode
, GetPageSegMode
,
SetRectangle
,
SetSourceResolution
, GetSourceYResolution
f = system.file("images", "OCRSample2.png", package = "Rtesseract")
api = tesseract(f)
GetInputName(api)
Recognize(api)
GetText(api)
bbox = GetBoxes(api)
conf = GetConfidences(api)
alts = GetAlternatives(api)
if(require("png")) {
i = readPNG(f)
plot(api, level = "symbol", img = i, border = "red")
}
## Not run:
# Don't run these in the interest of time. But they work fine.
if("rus" %in% getAvailableLanguages()) {
f = system.file("images/RussianDoc.png", package = "Rtesseract")
ans = GetText(tesseract(f, lang = "rus"))
ans$text
}
if("san" %in% getAvailableLanguages()) {
f = system.file("images/Sanscrit.png", package = "Rtesseract")
ans = GetBoxes(tesseract(f, lang = "san"))
ans$text
}
## End(Not run)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.