Goals.md

Goals for this GetDocElements

Given bounding boxes from either Rtesseract or ReadPDF,

  1. convert to a common format for subsequent operations

  2. Identify/reconstruct elements from the bounding boxes, including:

  3. columns - 2, 3 or more

  4. header, footer, and page numbers/etc.

  5. document title, authors, and date

  6. section headers

  7. section text, including sections that span pages or are interrupted by tables or figures.

  8. images/tables with captions - not parsed at this stage, but identified and collected

  9. lines, boxes and other page dividers?

Domain specific tasks for reading tabular data, bibliographies, etc. are handled by ReadArticle and ReadTabularDocs.



dsidavis/GetDocElements documentation built on July 8, 2019, 2:01 p.m.