ocr | R Documentation |
This is the primary high-level function for performing OCR on a
document.
It reads the image, identifies the elements (symbols/characters, words,
lines, etc.) as identified by the level
argument, and
then "recognizes" them. It returns the elements
and the score/"confidence" in the recognition of each.
The function can also return the elements and possible alternatives
along with their confidence scores.
If the image format is not supported by the leptonica library being
used, an error of class UnsupportedImageFormat
is raised.
The error message indicates the supported image types, and the
error object also contains the name of the file in the filename
element.
ocr(img, level = PageIteratorLevel["word"],
pageSegMode = "psm_single_block", alternatives = FALSE,
boundingBox = FALSE,
opts = sapply(list(...), as, "character"), ...)
img |
the file name of the image to be processed |
alternatives |
a logical value that controls whether the possible alternatives for each recognized element are returned. This allows us to compare the "confidence" we have in the predicted text relative to the other possibilities. |
pageSegMode |
a value indicating how to do the segmentation of the
elements on the page. This allows us to treat the content as a single
line, word, a vertical or horizontal block, or a general collection of
elements. Different values lead to identifying different elements such
as lines and rectangles rather than just words, symbols, and lines of
text.
The value should be consistent with the |
opts |
a character vector of |
... |
|
level |
an integer value specifying what level/unit to return the results for, e.g.,
paragraph, word, character. This can also be a character string
giving the name of an element in |
boundingBox |
a logical value indicating whether the bounding
boxes of the recognized items are returned. Currently, only one of
|
If alternatives
is FALSE
(the default),
this returns a numeric vector whose names are the elements
(characters, words, lines, paragraphs, blocks) that were recognized
and the values are the associated confidence levels.
If alternatives
is TRUE
, a list
is returned.
The names of the elements of the list are the recognized text.
Each element is a named numeric vector.
The names are the different possible values of the recognized element;
the numeric value is the confidence associated with possible choice.
Duncan Temple Lang
The Tesseract project https://code.google.com/p/tesseract-ocr/
#
f = system.file("images", "OCRSample.tiff", package = "Rtesseract")
o = ocr(f, "word")
names(o)
summary(o) # confidence levels
ans = ocr(f, "word", boundingBox = TRUE)
f = system.file("images", "IMG_1234.png", package = "Rtesseract")
o = ocr(f, "symbol", opts = c("tessedit_char_whitelist" = "0123456789."))
ans = ocr(f, "word", alternatives = TRUE, "tessedit_char_whitelist" = "0123456789.")
ans = ocr(f, "word", boundingBox = TRUE, "tessedit_char_whitelist" = "0123456789.")
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.