scan_with_hocr | R Documentation |
A Tesseract scan with HOCR output returns an XHTML document with not only the scanned word, but also information about the line where the word is found and the bounding box. The function scan_with_hocr
does the scan and converts the document to a data frame. See Details and Acknowledgment.
scan_with_hocr(
img,
confsel = F,
extract_bbox = T,
add_header_cols = F,
engine = tesseract::tesseract("eng")
)
img |
An image object or a character string with the name of an image file |
confsel |
A Boolean indicating if the confidence rate should also be selected |
extract_bbox |
A Boolean indicating if the bounding box should be unpacked (into x and y coordinates) |
add_header_cols |
A Boolean indicating if two header columns ( |
engine |
The OCR engine to use. See |
A data.frame with the scanned words. See Details
The result is a data.frame with one row for each word found and the following columns
line : the line on which the word was found
fldnr: the sequence number of the word on this line
word : the word that is recognized by the engine
bbox : the bounding box where the word was found (character string with e.g. '19 227 1087 251' indicating x-coordinates x1=19 and x2=1087 and y-coordinates y1=227 and y2=251). Not present when extract_bbox=T
is set: in that case x1, x2, y1 and y2 are present.
conf : the confidence rate of the word (only when confsel=T
is set)
header_col : column with 0
-s (only when add_header_cols=T
is set)
header_col_seq : column with 1
-s (only when add_header_cols=T
is set)
This function is an extension of the snippet by Jeroen Ooms. I only added the extraction of the line info. Afterwards I made it into a function and usable for connecting it with extract_table()
.
scanner_functions , cleanup_bw()
and extract_table()
## Not run:
df1 = scan_with_hocr(img2,add_header_cols=F)
## End(Not run)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.