PDF-class: Convert pdf document to plain text/XML.

Description Usage Format Fields Methods Arguments Examples

Description

To get the margins of the text: Use Rechteckige Auswahl, and Werkzeuge/Allgemeine Informationen

Usage

1

Format

An object of class R6ClassGenerator of length 24.

Fields

filename_pdf

a character vector (length 1) providing a filename

first

first integer, the first page

last

last integer, the last page

page

a specific page number

importFrom

xml2 read_xml xml_find_all xml_attrs

xml

parsed xml

text

a list of character vectors

jitter

integer, the deviation in lines that will be checked

margin

a named integer vector ("top", "bottom", "left", "right") indicating the margins of the text

deviation

allowed deviation of columns from page center

xmlification

xml to output

no_pages

number of pages of the pdf document (after pdf2xml)

Methods

$new(filename_pdf, first = NA, last = NA, jitter = 2, deviation = 10L, margins = integer())

Initialize a new instance of the class PDF to process a pdf document.

$validate()

Check all fields for correct content.

$show_pdf()

Show the pdf document.

$make_box(box = NULL, page)

Generate a box for the data.frame in the field $boxes. Coordinates a assumed to be in points and are recalibrated into pdf units.

$add_box(box = NULL, page = NULL, replace = TRUE)

Add a box that will serve as a crop box.

$drop_unboxed_text_nodes(node, boxes, copy = FALSE)

Remove any nodes that are not within defined boxes.

drop_page(page)

drop a page from from the XML of the pdf document

$remove_unboxed_text_from_all_pages()

Remove anything that is printed on pages beyond the defined boxes.

$decolumnize

Remove columnization, if pages are typeset with two columns. Muli-column layouts with three or more columns are not supported so far. The procedure adjusts the coordinates of text right of the the horizontal page center, i.e. the page height is added to the top position, and half of the page width substracted from the left position.

$get_pagesizes()

Get page width and height (points/pts and pdf units). The pdf units are extracted from the xmlified pdf document. To get sizes in points (pts), pdf_pagesize (package pdftools) is used. The result is a data.frame in the field pagesizes. The method is called when parsing the pdf document.

$get_text(node, paragraphs = TRUE)

Get the text from document in field 'xml'.

$get_number_of_pages()

get number of pages of XML document of pdf

$get_text_from_pages(paragraphs = TRUE)
$get_text_from_boxes(paragraphs)

Iterate through pages, and extract text as defined by boxes from pages. The result will be assigned to field pages.

regex

Find matches for regex on pages. The method returns the pages with at least one match for the regex.

$reorder()

Reorder text nodes on a page. Not yet functional!

$cut()

NOT WORKING

$reconstruct_paragraphs

Reconstruct paragraphs based on the following heuristic: If a line ends with a hyphen and is not stump, lines are concatenated.

$purge()

Remove noise, surplus whitespace signs from the text.

$xmlify(root = "document", metadata = NULL)

Turn content of field 'pages' into a XML document, optionally adding metadata.

$xml2md

Turn xmlified document into markdown (will be stored in field 'markdown').

$md2html()

Turn markdown (field markdown) into html document that will be stored in the field html.

$xml2html()

Turn xmlification of pdf document into html document to support quality checks.

$browse(viewer = getOption("viewer", utils::browseURL))

Show html document in browser.

$wrixte()

Save xmlified document (available in the field 'xmlification') to a file.

Arguments

filename_pdf

path to a pdf document

first

first page of the pdf document to be included

last

last page of the pdf document to be included

jitter

points up and down for reconstructing tilted lines

deviation

points that lines may deviate from middle of page

margins

named vector (top, bottom, left, right)

root

name of the root node

metadata

named character vector, attribtutes of root node of output xml document

filename

character vector

viewer

the viewer to use to inspect pdf or html documents

Examples

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
# Basic scenario: A straight-forward pdf without columns;
# it's only page numbers and text on the margins that disturbs

cdu_pdf <- system.file(package = "trickypdf", "extdata", "pdf", "cdu.pdf")
P <- PDF$new(filename_pdf = cdu_pdf, first = 7, last = 119)
# P$show_pdf()
P$add_box(box = c(top = 75, height = 700, left = 44, width = 500))
P$remove_unboxed_text_from_all_pages()
P$get_text_from_pages()
P$purge()
P$xmlify()
P$xml2html()
# P$browse()
output <- tempfile(fileext = ".xml")
P$write(filename = output)


# Advanced scenario I: Get text from pdf with columns, here: define boxes

doc <- system.file(package = "trickypdf", "extdata", "pdf", "UN_GeneralAssembly_2016.pdf")
UN <- PDF$new(filename_pdf = doc)
# UN$show_pdf()
UN$add_box(page = 1, box = c(top = 380, height = 250, left = 52, width = 255))
UN$add_box(page = 1, box = c(top = 232, height = 400, left = 303, width = 255), replace = FALSE)
UN$add_box(page = 2, box = c(top = 80, height = 595, left = 52, width = 255))
UN$add_box(page = 2, box = c(top = 80, height = 595, left = 303, width = 255), replace = FALSE)
UN$get_text_from_boxes(paragraphs = TRUE)
UN$xmlify()
UN$xml2html()
if (interactive()) UN$browse()

# Advanced scenario II: Get text from pdf with columns, long version

plenaryprotocol <- system.file(package = "trickypdf", "extdata", "pdf", "18238.pdf")
P <- PDF$new(filename_pdf = plenaryprotocol, first = 5, last = 73)
# P$show_pdf()
P$add_box(c(left = 58, width = 480, top = 70, height = 705))
P$remove_unboxed_text_from_all_pages()
P$deviation <- 10L
P$decolumnize()
P$get_text_from_pages()
P$purge()
P$xmlify()
P$xml2html()
if (interactive()) P$browse()

PolMine/trickypdf documentation built on Nov. 20, 2019, 8:01 p.m.