Description Usage Format Fields Methods Arguments Examples
To get the margins of the text: Use Rechteckige Auswahl, and Werkzeuge/Allgemeine Informationen
1 |
An object of class R6ClassGenerator
of length 24.
filename_pdf
a character vector (length 1) providing a filename
first
first integer, the first page
last
last integer, the last page
page
a specific page number
importFrom
xml2 read_xml xml_find_all xml_attrs
xml
parsed xml
text
a list of character vectors
jitter
integer, the deviation in lines that will be checked
margin
a named integer vector ("top", "bottom", "left", "right") indicating the margins of the text
deviation
allowed deviation of columns from page center
xmlification
xml to output
no_pages
number of pages of the pdf document (after pdf2xml)
$new(filename_pdf, first = NA, last = NA, jitter = 2, deviation
= 10L, margins = integer())
Initialize a new instance of the class PDF to process a pdf document.
$validate()
Check all fields for correct content.
$show_pdf()
Show the pdf document.
$make_box(box = NULL, page)
Generate a box for the data.frame
in the field $boxes
. Coordinates a assumed to be in points and are
recalibrated into pdf units.
$add_box(box = NULL, page = NULL, replace = TRUE)
Add a box that will serve as a crop box.
$drop_unboxed_text_nodes(node, boxes, copy = FALSE)
Remove any nodes that are not within defined boxes.
drop_page(page)
drop a page from from the XML of the pdf document
$remove_unboxed_text_from_all_pages()
Remove anything that is printed on pages beyond the defined boxes.
$decolumnize
Remove columnization, if pages are typeset with two columns. Muli-column layouts with three or more columns are not supported so far. The procedure adjusts the coordinates of text right of the the horizontal page center, i.e. the page height is added to the top position, and half of the page width substracted from the left position.
$get_pagesizes()
Get page width and height (points/pts and
pdf units). The pdf units are extracted from the xmlified pdf document. To
get sizes in points (pts), pdf_pagesize
(package pdftools) is used.
The result is a data.frame in the field pagesizes. The method is called
when parsing the pdf document.
$get_text(node, paragraphs = TRUE)
Get the text from document in field 'xml'.
$get_number_of_pages()
get number of pages of XML document of pdf
$get_text_from_pages(paragraphs = TRUE)
$get_text_from_boxes(paragraphs)
Iterate through pages, and
extract text as defined by boxes from pages. The result will be assigned to
field pages
.
regex
Find matches for regex on pages. The method returns the pages with at least one match for the regex.
$reorder()
Reorder text nodes on a page. Not yet functional!
$cut()
NOT WORKING
$reconstruct_paragraphs
Reconstruct paragraphs based on the following heuristic: If a line ends with a hyphen and is not stump, lines are concatenated.
$purge()
Remove noise, surplus whitespace signs from the text.
$xmlify(root = "document", metadata = NULL)
Turn content of field 'pages' into a XML document, optionally adding metadata.
$xml2md
Turn xmlified document into markdown (will be stored in field 'markdown').
$md2html()
Turn markdown (field markdown
) into html
document that will be stored in the field html
.
$xml2html()
Turn xmlification of pdf document into html document to support quality checks.
$browse(viewer = getOption("viewer", utils::browseURL))
Show html document in browser.
$wrixte()
Save xmlified document (available in the field 'xmlification') to a file.
filename_pdf
path to a pdf document
first
first page of the pdf document to be included
last
last page of the pdf document to be included
jitter
points up and down for reconstructing tilted lines
deviation
points that lines may deviate from middle of page
margins
named vector (top, bottom, left, right)
root
name of the root node
metadata
named character vector, attribtutes of root node of output xml document
filename
character vector
viewer
the viewer to use to inspect pdf or html documents
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 | # Basic scenario: A straight-forward pdf without columns;
# it's only page numbers and text on the margins that disturbs
cdu_pdf <- system.file(package = "trickypdf", "extdata", "pdf", "cdu.pdf")
P <- PDF$new(filename_pdf = cdu_pdf, first = 7, last = 119)
# P$show_pdf()
P$add_box(box = c(top = 75, height = 700, left = 44, width = 500))
P$remove_unboxed_text_from_all_pages()
P$get_text_from_pages()
P$purge()
P$xmlify()
P$xml2html()
# P$browse()
output <- tempfile(fileext = ".xml")
P$write(filename = output)
# Advanced scenario I: Get text from pdf with columns, here: define boxes
doc <- system.file(package = "trickypdf", "extdata", "pdf", "UN_GeneralAssembly_2016.pdf")
UN <- PDF$new(filename_pdf = doc)
# UN$show_pdf()
UN$add_box(page = 1, box = c(top = 380, height = 250, left = 52, width = 255))
UN$add_box(page = 1, box = c(top = 232, height = 400, left = 303, width = 255), replace = FALSE)
UN$add_box(page = 2, box = c(top = 80, height = 595, left = 52, width = 255))
UN$add_box(page = 2, box = c(top = 80, height = 595, left = 303, width = 255), replace = FALSE)
UN$get_text_from_boxes(paragraphs = TRUE)
UN$xmlify()
UN$xml2html()
if (interactive()) UN$browse()
# Advanced scenario II: Get text from pdf with columns, long version
plenaryprotocol <- system.file(package = "trickypdf", "extdata", "pdf", "18238.pdf")
P <- PDF$new(filename_pdf = plenaryprotocol, first = 5, last = 73)
# P$show_pdf()
P$add_box(c(left = 58, width = 480, top = 70, height = 705))
P$remove_unboxed_text_from_all_pages()
P$deviation <- 10L
P$decolumnize()
P$get_text_from_pages()
P$purge()
P$xmlify()
P$xml2html()
if (interactive()) P$browse()
|
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.