knitr::opts_chunk$set( collapse = TRUE, comment = "#>", fig.path = "man/figures/README-", out.width = "100%" )
The goal of hocr
is to facilitate post-OCR data processing and wrangling. The package exposes hocr parcer, hocr_parse
, which converts XHTML format output into tidy tibble with one word per row. In addition to the columns exported by tesseract::ocr_data
, hocr
outputs additional metadata regarding organization of words into lines, paragraphs, content areas and pages. Read more about hOCR specification here.
One of the key elements of hocr
format is "bounding box" - a rectangular region of the image covering the extent of the word recognized by tesseract
. This bbox can be used to extract respective part of the image using, for example magick
package, using bbox_to_geometry
helper function.
hocr
aslo includes tidiers for common hOCR-capable systems. As of version 0.0.9000 only tesseract
output format is supported, but in the future, support for OCRopus
will be added.
You can install the development version from GitHub with:
# install.packages("devtools") devtools::install_github("dmi3kno/hocr")
This is a basic example which shows you how to solve a common problem:
library(hocr) library(tesseract) # OCR library(tidyverse) # data wrangling and viz #devtools::install_github("thomasp85/patchwork") library(patchwork) # arranging plots
We will OCR a page from an old cookbook retrieved from archive.org[1] and enhanced using magick
package (see image preparation script on github).
cupcakes <- system.file("extdata", "peanutbutter.png", package="hocr") recipe <- tesseract::ocr(cupcakes, HOCR = TRUE) %>% hocr::hocr_parse() %>% hocr::tidy_tesseract() recipe
Now that data is in the tidy format, lets render the page in ggplot and identify bounding boxes around words and paragraphs to illustrate the benefits of parsed document structure. tesseract
outputs bboxes in upper-left corner coordinate system. We will transform all y-values to bottom-left scale and plot the bounding boxes alongside with the original picture, colored by tesseract
confidence score.
p1 <- recipe %>% mutate(ocrx_word_bbox=lapply(ocrx_word_bbox, function(x) separate(as_tibble(x), value, into=c("word_x1", "word_y1", "word_x2", "word_y2"), convert = TRUE))) %>% unnest(ocrx_word_bbox) %>% mutate(ocr_page_bbox=lapply(ocr_page_bbox, function(x) separate(as_tibble(x), value, into=c("page_x1", "page_y1", "page_x2", "page_y2"), convert = TRUE))) %>% unnest(ocr_page_bbox) %>% mutate(word_y1=max(page_y2)-word_y1, word_y2=max(page_y2)-word_y2) %>% ggplot(aes(xmin=word_x1, ymin=word_y1, xmax=word_x2, ymax=word_y2))+ geom_rect(aes(color=ocr_par_id, fill=ocrx_word_conf), show.legend = TRUE)+ theme_minimal()+ theme(panel.grid = element_blank(), axis.text = element_text(size = 7), legend.text = element_text(size = 7), legend.title = element_text(size = 7)) library(png) library(grid) img <- readPNG(cupcakes) p2 <- rasterGrob(img, interpolate=TRUE) p1+p2
Similar projects are listed here
[1] Rosenberg L. M.(1986) Muffins & cupcakes, American Cooking Guild, Gaithersburg, MD. Openlibrary edition OL1484439M. Accessed from: https://archive.org/details/muffinscupcakes00rose on 28 July 2018
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.