tesseract: Creating a Tesseract Object
In duncantl/Rtesseract: Interface to the tesseract OCR system

tesseract

R Documentation

Creating a Tesseract Object

Description

The ocr function hides all of the details of the OCR mechanism. The tesseract function provides an entry to much more control and detail about the process and the ability to query the results in different ways. The tesseract function is the building block for all the functionality of the package.

To perform OCR on an image, we need a Tesseract API object and then can call its methods. We create such an object with the tesseract function. We typically specify the image, the segmentation mode for identifying the elements in the page, and also the language for the content. We can also set any of the over 600 variables that control how tesseract operates on the image. (See PrintVariables.)

We can create a tesseract object without specifying an image, the segmentation mode, etc. Instead, we can set this later when we use the tesseract object, and we can also update these values to reuse the same tesseract object on different images, with different segmentation modes, etc. Alternatively, we can create new separate tesseract objects that work on different images or in different segmentation modes.

The tesseract function, by default, performs the initialization of the internal C++ object. This can be deferred if necessary (with init = FALSE. Also, a tesseract object can be re-initialized.

One has to be careful when initializing a tesseract object because it will reset the page segmentation mode to a default value. So it is vital to set that after initialization. The tesseract function takes care of this.

We can set tesseract variables at any point in the lifetime of the tesseract object, either in the call to tesseract, Init or SetVariables. However, there are some variables that need to be explicitly set in the call to Init or before.

There are many functions that operate on a TesseractBaseAPI object that provide the interesting functionality for the package. These are noted in the see also section and in other help pages, e.g., GetDatapath, GetVariables, SetVariables. We can load configuration files of variable settings, set the image to be processed, the page segmentation mode and the resolution specification of the image. We can also query these.

A potentially very useful facility is to specify the sub-region of the current image for tesseract to analyze with SetRectangle. This allows us to zoom in and redo the OCR within this constrained context.

When we have configured the tesseract object appropriately, we call Recognize to perform the OCR. We then collect the results in which we are interested.

We can extract the bounding boxes of the elements it recognizes. We can get the confidence associated with each match it made. We can obtain the alternatives or "next best" guesses the OCR had for each element.

We can switch the image to be processed, the page segmentation mode, quey the file name of the current image, get the image itself, query the current variable settings for tesseract, ask if a particular word is considered valid.

We can also zoom in to a sub-region of the image and run the OCR again on this part.

We can query the current state and settings of the tesseract object.

The package also provides a way to plot a tesseract object, i.e., show the image and superimpose the bounding boxes of the recognized elements in the image. One can color the rectangles by their confidence/accuracy levels. In the future, we may develop an interactive plot to allow users to visually explore and compare the original image and the OCR text.

We can also treat the TesseractBaseAPI as a virtual list and then use lapply to iterate over its elements. There are functions within the package that can be used within the lapply to extract the individual elements of the results. These are BoundingBox, GetAlternatives, ... There are equivalent higher-level functions to access these elements more efficiently, but with less flexibility.

Usage

tesseract(image = character(), pageSegMode = integer(), lang = "eng",
          datapath = NA, configs = character(), vars = character(),
          engineMode = OEM_DEFAULT, debugOnly = FALSE,
          ..., opts = list(...), init = TRUE)


Init(api, lang = "eng", configs = character(), vars =
     character(), datapath = NA, engineMode = OEM_DEFAULT,
     debugOnly = FALSE, force2 = TRUE)

Recognize(api)

Arguments

`image`	either a `Pix-class` object, or a file name from which to read the image. Specifying the name of a file also arranges to call `SetInputName` and so the `TesseractBaseAPI-class` instance knows where the image is located. This means we can query it. If speciyfing the name of a file, make certain to assign the result to a variable that persists until `Recognize` is called for this `TesseractBaseAPI-class` instance. In the future, we will ensure that garbage collection protects the image, but it is not the case now.
`api`	the instance of the `TesseractBaseAPI-class` in which to perform the operations.
`lang`	a string identifying the language(s) for the character recognition
`datapath`	the name of a directory that contains the tessdata/ directory.
`engineMode`	the mode for the OCR engine. The default is to use tesseract. One can use a Cube method, or a Cube and Tesseract combination. See the `OcrEngineMode` enumerated constant vector.
`configs`	a named character vector of configuration arguments
`vars`	variables to set for controlling Tesseract
`debugOnly`	a logical value that controls whether in the Init() call, only non-debugging variables in `vars` are to be processed.
`...`	`name = value` pairs that are passed to `SetVariables` to configure the `TesseractBaseAPI-class` instance.
`opts`	a list (or vector) of named values that are the options we can pass via `...`.
`init`	a logical value controlling whether `Init` is called by the `tesseract` function.
`pageSegMode`	the value for the page segmentation mode for the tesseract instance. This must correspond to one of the values in `PageSegModeValues` or the corresponding R variables. However, one can use symbolic names (lower or upper case) from this vector, e.g., `"psm_auto"`.
`force2`	a logical value to control whether to use the Init2 routine. Should never be needed.

Value

The tesseract function returns an object of class TesseractBaseAPI. This is an S4 object that contains an opaque reference to a C++ object. It should be used in other calls expecting the tesseract instance.

For tesseract, if the file format of the image is not supported by the installed leptonica library being used, an error of class UnsupportedImageFormat is raised. The error message indicates the supported image types, and the error object also contains the name of the file in the filename element.

Init returns a logical value indicating whether the initialization was successful TRUE, or raises an error of class TesseractInitFailure. This provides an opportunity to catch the error and use a different initialization approach (e.g. different language, variables, engine mode).

Recognize returns TRUE if the call was successful.

Author(s)

Duncan Temple Lang

References

Tesseract https://code.google.com/p/tesseract-ocr/, specifically http://zdenop.github.io/tesseract-doc/classtesseract_1_1_tess_base_a_p_i.html

Examples

 f = system.file("images", "OCRSample2.png", package = "Rtesseract")
 api = tesseract(f)
 GetInputName(api)
 Recognize(api)
 GetText(api)
 bbox = GetBoxes(api)
 conf = GetConfidences(api)
 alts = GetAlternatives(api)

 if(require("png")) {
    i = readPNG(f)
    plot(api, level = "symbol", img = i, border = "red")
 }


## Not run: 
     # Don't run these in the interest of time. But they work fine.
if("rus" %in% getAvailableLanguages()) {
   f = system.file("images/RussianDoc.png", package = "Rtesseract")
   ans = GetText(tesseract(f, lang = "rus"))
   ans$text
}

if("san" %in% getAvailableLanguages()) {
   f = system.file("images/Sanscrit.png", package = "Rtesseract")
   ans = GetBoxes(tesseract(f, lang = "san"))
   ans$text
}

## End(Not run)

duncantl/Rtesseract documentation built on Sept. 8, 2024, 8:38 a.m.