Optical Character Recognition with imagerExtara

knitr::opts_chunk$set(warning = FALSE, message = FALSE, cache = FALSE, 
               comment = NA, verbose = TRUE, fig.width = 5, fig.height = 5, dev = 'jpeg', dev.args=list(quality=50))
is_available_tesseract <- requireNamespace("tesseract", quietly = TRUE)

You need the R package tesseract, which is bindings to a powerful optical character recognition (OCR) engine, to do OCR with imagerExtra.

See the installation guide of tesseract if you haven't installed tesseract.

ocr function of tesseract works best for images with high contrast, little noise, and horizontal text.

ocr function doesn't show a good performance for degraded images as shown below.

library(imagerExtra)
plot(papers, main = "Original")
OCR(papers) %>% print
OCR_data(papers) %>% print

OCR function and OCR_data function are wrappers for ocr function and ocr_data function of tesseract.

We can see OCR function and OCR_data function failed to recognize the text "Hello".

We need to clean the image before using OCR function.

hello <- DenoiseDCT(papers, 0.01) %>% ThresholdAdaptive(., 0.1, range = c(0,1))
plot(hello, main = "Hello")
OCR(hello) %>% print
OCR_data(hello) %>% print

We can see the text "Hello" was recognized.

Using tesseract in combination with imagerExtra enables us to extract text from degraded images.



Try the imagerExtra package in your browser

Any scripts or data that you put into this service are public.

imagerExtra documentation built on May 2, 2019, 1:44 p.m.