Rtesseract: Interface to the tesseract OCR system

ocr	R Documentation

Perform OCR on an image

Description

This is the primary high-level function for performing OCR on a document. It reads the image, identifies the elements (symbols/characters, words, lines, etc.) as identified by the level argument, and then "recognizes" them. It returns the elements and the score/"confidence" in the recognition of each. The function can also return the elements and possible alternatives along with their confidence scores.

If the image format is not supported by the leptonica library being used, an error of class UnsupportedImageFormat is raised. The error message indicates the supported image types, and the error object also contains the name of the file in the filename element.

Usage

ocr(img, level = PageIteratorLevel["word"],
    pageSegMode = "psm_single_block", alternatives = FALSE,
    boundingBox = FALSE,
    opts = sapply(list(...), as, "character"), ...)

Arguments

`img`	the file name of the image to be processed
`alternatives`	a logical value that controls whether the possible alternatives for each recognized element are returned. This allows us to compare the "confidence" we have in the predicted text relative to the other possibilities.
`pageSegMode`	a value indicating how to do the segmentation of the elements on the page. This allows us to treat the content as a single line, word, a vertical or horizontal block, or a general collection of elements. Different values lead to identifying different elements such as lines and rectangles rather than just words, symbols, and lines of text. The value should be consistent with the `PageSegMode-class` enumerated type.
`opts`	a character vector of `name = value` pairs specifying tesseract variables and values.
`...`	`name = value` pairs for specifying tesseract options individually rather than in a character vector via `opts`.
`level`	an integer value specifying what level/unit to return the results for, e.g., paragraph, word, character. This can also be a character string giving the name of an element in `PageIteratorLevel`
`boundingBox`	a logical value indicating whether the bounding boxes of the recognized items are returned. Currently, only one of `alternatives` or `boundingBox` can be `TRUE` (and both can be `FALSE`).

Value

If alternatives is FALSE (the default), this returns a numeric vector whose names are the elements (characters, words, lines, paragraphs, blocks) that were recognized and the values are the associated confidence levels.

If alternatives is TRUE, a list is returned. The names of the elements of the list are the recognized text. Each element is a named numeric vector. The names are the different possible values of the recognized element; the numeric value is the confidence associated with possible choice.

Author(s)

Duncan Temple Lang

References

The Tesseract project https://code.google.com/p/tesseract-ocr/

Examples

# 
 f = system.file("images", "OCRSample.tiff", package = "Rtesseract")
 o = ocr(f, "word")
 names(o)
 summary(o)  # confidence levels

 ans = ocr(f, "word", boundingBox = TRUE)

 f = system.file("images", "IMG_1234.png", package = "Rtesseract")
 o = ocr(f, "symbol", opts = c("tessedit_char_whitelist" = "0123456789."))

 ans = ocr(f, "word", alternatives = TRUE, "tessedit_char_whitelist" = "0123456789.")

 ans = ocr(f, "word", boundingBox = TRUE, "tessedit_char_whitelist" = "0123456789.")

duncantl/Rtesseract documentation built on Sept. 8, 2024, 8:38 a.m.