In this package, Documents are considered one or multi-page containers for text and we are primarily focusing on PDF and scanned (OCR'ed) documents. We don't exclude Word, OpenOffice, Pages and other forms of word processing documents, or diagrams or even spreadsheets (although these have much richer structure).
Dociface provides virtual classes and generic functions for identifying document elements. Classes and methods specific to PDF documents can be found in ReadPDF, while those for OCR documents can be found in Rtesseract. Methods specific to certain types of documents, e.g. tabular data, academic papers, etc., can be found in additional packages (in development).
Dociface provides generics for identifying:
Text characteristics, including the bounding boxes of the text
Columns and column positions
Header and footer text
Line breaks, margins, and other "whitespace" elements
Shapes, lines, and figures
Font information (or text information, in the case of OCR documents)
Section titles/headers and text by section
Additionally, Dociface
provides methods to plot one or multiple
DocumentPage
s.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.