In dmi3kno/hocr: Text-to-tibble

knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>",
  fig.path = "man/figures/README-",
  out.width = "100%"
)

hocr

The goal of hocr is to facilitate post-OCR data processing and wrangling. The package exposes hocr parcer, hocr_parse, which converts XHTML format output into tidy tibble with one word per row. In addition to the columns exported by tesseract::ocr_data, hocr outputs additional metadata regarding organization of words into lines, paragraphs, content areas and pages. Read more about hOCR specification here.

One of the key elements of hocr format is "bounding box" - a rectangular region of the image covering the extent of the word recognized by tesseract. This bbox can be used to extract respective part of the image using, for example magick package, using bbox_to_geometry helper function.

hocr aslo includes tidiers for common hOCR-capable systems. As of version 0.0.9000 only tesseract output format is supported, but in the future, support for OCRopus will be added.

Installation

You can install the development version from GitHub with:

# install.packages("devtools")
devtools::install_github("dmi3kno/hocr")

Example

This is a basic example which shows you how to solve a common problem:

library(hocr)
library(tesseract) # OCR
library(tidyverse) # data wrangling and viz
#devtools::install_github("thomasp85/patchwork")
library(patchwork) # arranging plots

We will OCR a page from an old cookbook retrieved from archive.org[1] and enhanced using magick package (see image preparation script on github).

cupcakes <- system.file("extdata", "peanutbutter.png", package="hocr")


recipe <- tesseract::ocr(cupcakes, HOCR = TRUE) %>% 
  hocr::hocr_parse() %>% 
  hocr::tidy_tesseract()
recipe

Now that data is in the tidy format, lets render the page in ggplot and identify bounding boxes around words and paragraphs to illustrate the benefits of parsed document structure. tesseract outputs bboxes in upper-left corner coordinate system. We will transform all y-values to bottom-left scale and plot the bounding boxes alongside with the original picture, colored by tesseract confidence score.

p1 <- recipe %>% 
  mutate(ocrx_word_bbox=lapply(ocrx_word_bbox, function(x) 
    separate(as_tibble(x), value, into=c("word_x1", "word_y1", "word_x2", "word_y2"), convert = TRUE))) %>% 
    unnest(ocrx_word_bbox) %>% 
  mutate(ocr_page_bbox=lapply(ocr_page_bbox, function(x) 
    separate(as_tibble(x), value, into=c("page_x1", "page_y1", "page_x2", "page_y2"), convert = TRUE))) %>% 
    unnest(ocr_page_bbox) %>% 
  mutate(word_y1=max(page_y2)-word_y1,
         word_y2=max(page_y2)-word_y2) %>% 
    ggplot(aes(xmin=word_x1, ymin=word_y1, xmax=word_x2, ymax=word_y2))+
    geom_rect(aes(color=ocr_par_id, fill=ocrx_word_conf), show.legend = TRUE)+
  theme_minimal()+
  theme(panel.grid = element_blank(), 
        axis.text = element_text(size = 7), 
        legend.text = element_text(size = 7), 
        legend.title = element_text(size = 7))

library(png)
library(grid)
img <- readPNG(cupcakes)
p2 <- rasterGrob(img, interpolate=TRUE)

p1+p2

Similar projects are listed here

[1] Rosenberg L. M.(1986) Muffins & cupcakes, American Cooking Guild, Gaithersburg, MD. Openlibrary edition OL1484439M. Accessed from: https://archive.org/details/muffinscupcakes00rose on 28 July 2018

dmi3kno/hocr documentation built on April 27, 2020, 10:39 a.m.

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

Tweet to @rdrrHQ

GitHub issue tracker

ian@mutexlabs.com