README.md

This package provides some basic code for reading the XML documents Abbyy's OCR (Optical Character Recognition) software generates as a format. This contains information about the text that was recognized/recovered along with it location on the page and some additional metadata. We can use this to recover structure on the page such as tables, columns, paragraphs, ...

This is old code we had for a while and now have some use for.

This relates to the Rtesseract package which is a different, open source OCR engine. We get similar information from its engine and so we can use either or both to recover the structure.

It is also related to extracting information from a PDF document.



dsidavis/AbbyyXML documentation built on May 23, 2019, 8:38 a.m.