Using the Tesseract OCR engine in R

library(tibble)
#knitr::opts_chunk$set(comment = "")
has_nld <- "nld" %in% tesseract::tesseract_info()$available
if(identical(Sys.info()[['user']], 'jeroen')) stopifnot(has_nld)

The tesseract package provides R bindings Tesseract: a powerful optical character recognition (OCR) engine that supports over 100 languages. The engine is highly configurable in order to tune the detection algorithms and obtain the best possible results.

Keep in mind that OCR (pattern recognition in general) is a very difficult problem for computers. Results will rarely be perfect and the accuracy rapidly decreases with the quality of the input image. But if you can get your input images to reasonable quality, Tesseract can often help to extract most of the text from the image.

Extract Text from Images

OCR is the process of finding and recognizing text inside images, for example from a screenshot, scanned paper. The image below has some example text:

test{data-external=1}

library(tesseract)
eng <- tesseract("eng")
text <- tesseract::ocr("http://jeroen.github.io/images/testocr.png", engine = eng)
cat(text)

Not bad! The ocr_data() function returns all words in the image along with a bounding box and confidence rate.

results <- tesseract::ocr_data("http://jeroen.github.io/images/testocr.png", engine = eng)
results

Language Data

The tesseract OCR engine uses language-specific training data in the recognize words. The OCR algorithms bias towards words and sentences that frequently appear together in a given language, just like the human brain does. Therefore the most accurate results will be obtained when using training data in the correct language.

Use tesseract_info() to list the languages that you currently have installed.

tesseract_info()

By default the R package only includes English training data. Windows and Mac users can install additional training data using tesseract_download(). Let's OCR a screenshot from Wikipedia in Dutch (Nederlands)

utrecht

# Only need to do download once:
tesseract_download("nld")
# Now load the dictionary
(dutch <- tesseract("nld"))
text <- ocr("https://jeroen.github.io/images/utrecht2.png", engine = dutch)
cat(text)

As you can see immediately: almost perfect! (OK just take my word).

Preprocessing with Magick

The accuracy of the OCR process depends on the quality of the input image. You can often improve results by properly scaling the image, removing noise and artifacts or cropping the area where the text exists. See tesseract wiki: improve quality for important tips to improve the quality of your input image.

The awesome magick R package has many useful functions that can be use for enhancing the quality of the image. Some things to try:

Below is an example OCR scan from an online AI course. The code converts it to black-and-white and resizes + crops the image before feeding it to tesseract to get more accurate OCR results.

bowers{data-external=1}

library(magick)
input <- image_read("https://jeroen.github.io/images/bowers.jpg")

text <- input %>%
  image_resize("2000x") %>%
  image_convert(type = 'Grayscale') %>%
  image_trim(fuzz = 40) %>%
  image_write(format = 'png', density = '300x300') %>%
  tesseract::ocr() 

cat(text)

Read from PDF files

If your images are stored in PDF files they first need to be converted to a proper image format. We can do this in R using the pdf_convert function from the pdftools package. Use a high DPI to keep quality of the image.

pngfile <- pdftools::pdf_convert('https://jeroen.github.io/images/ocrscan.pdf', dpi = 600)
text <- tesseract::ocr(pngfile)
cat(text)

Tesseract Control Parameters

Tesseract supports hundreds of "control parameters" which alter the OCR engine. Use tesseract_params() to list all parameters with their default value and a brief description. It also has a handy filter argument to quickly find parameters that match a particular string.

# List all parameters with *colour* in name or description
tesseract_params('colour')

Do note that some of the control parameters have changed between Tesseract engine 3 and 4.

tesseract::tesseract_info()['version']

Whitelist / Blacklist characters

One powerful parameter is tessedit_char_whitelist which restricts the output to a limited set of characters. This may be useful for reading for example numbers such as a bank account, zip code, or gas meter.

The whitelist parameter works for all versions of Tesseract engine 3 and also engine versions 4.1 and higher, but unfortunately it did not work in Tesseract 4.0.

receipt{data-external=1}

numbers <- tesseract(options = list(tessedit_char_whitelist = "$.0123456789"))
cat(ocr("https://jeroen.github.io/images/receipt.png", engine = numbers))

To test if this actually works, look what happens if we remove the $ from tessedit_char_whitelist:

# Do not allow any dollar sign 
numbers2 <- tesseract(options = list(tessedit_char_whitelist = ".0123456789"))
cat(ocr("https://jeroen.github.io/images/receipt.png", engine = numbers2))


Try the tesseract package in your browser

Any scripts or data that you put into this service are public.

tesseract documentation built on May 30, 2022, 1:05 a.m.