scan_with_hocr: doing a Tesseract scan with HOCR output

scan_with_hocrR Documentation

doing a Tesseract scan with HOCR output

Description

A Tesseract scan with HOCR output returns an XHTML document with not only the scanned word, but also information about the line where the word is found and the bounding box. The function scan_with_hocr does the scan and converts the document to a data frame. See Details and Acknowledgment.

Usage

scan_with_hocr(
  img,
  confsel = F,
  extract_bbox = T,
  add_header_cols = F,
  engine = tesseract::tesseract("eng")
)

Arguments

img

An image object or a character string with the name of an image file

confsel

A Boolean indicating if the confidence rate should also be selected

extract_bbox

A Boolean indicating if the bounding box should be unpacked (into x and y coordinates)

add_header_cols

A Boolean indicating if two header columns (header_col and header_col_seq) should be added to the result and initialized to resp. 0 and 1. Useful when extract_table() is used later on.

engine

The OCR engine to use. See tesseract::tesseract()

Value

A data.frame with the scanned words. See Details

Details

The result is a data.frame with one row for each word found and the following columns

  • line : the line on which the word was found

  • fldnr: the sequence number of the word on this line

  • word : the word that is recognized by the engine

  • bbox : the bounding box where the word was found (character string with e.g. '19 227 1087 251' indicating x-coordinates x1=19 and x2=1087 and y-coordinates y1=227 and y2=251). Not present when extract_bbox=T is set: in that case x1, x2, y1 and y2 are present.

  • conf : the confidence rate of the word (only when confsel=T is set)

  • header_col : column with 0-s (only when add_header_cols=T is set)

  • header_col_seq : column with 1-s (only when add_header_cols=T is set)

Acknowledgment

This function is an extension of the snippet by Jeroen Ooms. I only added the extraction of the line info. Afterwards I made it into a function and usable for connecting it with extract_table().

See Also

scanner_functions , cleanup_bw() and extract_table()

Examples

## Not run: 
df1  = scan_with_hocr(img2,add_header_cols=F)

## End(Not run)

HanOostdijk/HOQCutil documentation built on July 28, 2023, 5:56 p.m.