knitr::opts_chunk$set(warning = FALSE, message = FALSE, cache = FALSE, comment = NA, verbose = TRUE, fig.width = 5, fig.height = 5, dev = 'jpeg', dev.args=list(quality=50)) is_available_tesseract <- requireNamespace("tesseract", quietly = TRUE)
You need the R package tesseract, which is bindings to a powerful optical character recognition (OCR) engine, to do OCR with imagerExtra.
See the installation guide of tesseract if you haven't installed tesseract.
ocr function of tesseract works best for images with high contrast, little noise, and horizontal text.
ocr function doesn't show a good performance for degraded images as shown below.
library(imagerExtra) plot(papers, main = "Original") OCR(papers) %>% print OCR_data(papers) %>% print
OCR function and OCR_data function are wrappers for ocr function and ocr_data function of tesseract.
We can see OCR function and OCR_data function failed to recognize the text "Hello".
We need to clean the image before using OCR function.
hello <- DenoiseDCT(papers, 0.01) %>% ThresholdAdaptive(., 0.1, range = c(0,1)) plot(hello, main = "Hello") OCR(hello) %>% print OCR_data(hello) %>% print
We can see the text "Hello" was recognized.
Using tesseract in combination with imagerExtra enables us to extract text from degraded images.
Any scripts or data that you put into this service are public.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.