README.md

Dociface: an R package for identifying and reconstructing elements from text documents

In this package, Documents are considered one or multi-page containers for text and we are primarily focusing on PDF and scanned (OCR'ed) documents. We don't exclude Word, OpenOffice, Pages and other forms of word processing documents, or diagrams or even spreadsheets (although these have much richer structure).

Dociface provides virtual classes and generic functions for identifying document elements. Classes and methods specific to PDF documents can be found in ReadPDF, while those for OCR documents can be found in Rtesseract. Methods specific to certain types of documents, e.g. tabular data, academic papers, etc., can be found in additional packages (in development).

Essential functions

Dociface provides generics for identifying:

Additionally, Dociface provides methods to plot one or multiple DocumentPages.

Writing specific methods for new document types



dsidavis/Dociface documentation built on Nov. 20, 2023, 5:44 a.m.