Description Usage Format Fields Methods Arguments Examples
To get the margins of the text: Use Rechteckige Auswahl, and Werkzeuge/Allgemeine Informationen
1 |
An object of class R6ClassGenerator of length 24.
filename_pdfa character vector (length 1) providing a filename
firstfirst integer, the first page
lastlast integer, the last page
pagea specific page number
importFromxml2 read_xml xml_find_all xml_attrs
xmlparsed xml
texta list of character vectors
jitterinteger, the deviation in lines that will be checked
margina named integer vector ("top", "bottom", "left", "right") indicating the margins of the text
deviationallowed deviation of columns from page center
xmlificationxml to output
no_pagesnumber of pages of the pdf document (after pdf2xml)
$new(filename_pdf, first = NA, last = NA, jitter = 2, deviation
= 10L, margins = integer())Initialize a new instance of the class PDF to process a pdf document.
$validate()Check all fields for correct content.
$show_pdf()Show the pdf document.
$make_box(box = NULL, page)Generate a box for the data.frame
in the field $boxes. Coordinates a assumed to be in points and are
recalibrated into pdf units.
$add_box(box = NULL, page = NULL, replace = TRUE)Add a box that will serve as a crop box.
$drop_unboxed_text_nodes(node, boxes, copy = FALSE)Remove any nodes that are not within defined boxes.
drop_page(page)drop a page from from the XML of the pdf document
$remove_unboxed_text_from_all_pages()Remove anything that is printed on pages beyond the defined boxes.
$decolumnizeRemove columnization, if pages are typeset with two columns. Muli-column layouts with three or more columns are not supported so far. The procedure adjusts the coordinates of text right of the the horizontal page center, i.e. the page height is added to the top position, and half of the page width substracted from the left position.
$get_pagesizes()Get page width and height (points/pts and
pdf units). The pdf units are extracted from the xmlified pdf document. To
get sizes in points (pts), pdf_pagesize (package pdftools) is used.
The result is a data.frame in the field pagesizes. The method is called
when parsing the pdf document.
$get_text(node, paragraphs = TRUE)Get the text from document in field 'xml'.
$get_number_of_pages()get number of pages of XML document of pdf
$get_text_from_pages(paragraphs = TRUE)$get_text_from_boxes(paragraphs)Iterate through pages, and
extract text as defined by boxes from pages. The result will be assigned to
field pages.
regexFind matches for regex on pages. The method returns the pages with at least one match for the regex.
$reorder()Reorder text nodes on a page. Not yet functional!
$cut()NOT WORKING
$reconstruct_paragraphsReconstruct paragraphs based on the following heuristic: If a line ends with a hyphen and is not stump, lines are concatenated.
$purge()Remove noise, surplus whitespace signs from the text.
$xmlify(root = "document", metadata = NULL)Turn content of field 'pages' into a XML document, optionally adding metadata.
$xml2mdTurn xmlified document into markdown (will be stored in field 'markdown').
$md2html()Turn markdown (field markdown) into html
document that will be stored in the field html.
$xml2html()Turn xmlification of pdf document into html document to support quality checks.
$browse(viewer = getOption("viewer", utils::browseURL))Show html document in browser.
$wrixte()Save xmlified document (available in the field 'xmlification') to a file.
filename_pdfpath to a pdf document
firstfirst page of the pdf document to be included
lastlast page of the pdf document to be included
jitterpoints up and down for reconstructing tilted lines
deviationpoints that lines may deviate from middle of page
marginsnamed vector (top, bottom, left, right)
rootname of the root node
metadatanamed character vector, attribtutes of root node of output xml document
filenamecharacter vector
viewerthe viewer to use to inspect pdf or html documents
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 | # Basic scenario: A straight-forward pdf without columns;
# it's only page numbers and text on the margins that disturbs
cdu_pdf <- system.file(package = "trickypdf", "extdata", "pdf", "cdu.pdf")
P <- PDF$new(filename_pdf = cdu_pdf, first = 7, last = 119)
# P$show_pdf()
P$add_box(box = c(top = 75, height = 700, left = 44, width = 500))
P$remove_unboxed_text_from_all_pages()
P$get_text_from_pages()
P$purge()
P$xmlify()
P$xml2html()
# P$browse()
output <- tempfile(fileext = ".xml")
P$write(filename = output)
# Advanced scenario I: Get text from pdf with columns, here: define boxes
doc <- system.file(package = "trickypdf", "extdata", "pdf", "UN_GeneralAssembly_2016.pdf")
UN <- PDF$new(filename_pdf = doc)
# UN$show_pdf()
UN$add_box(page = 1, box = c(top = 380, height = 250, left = 52, width = 255))
UN$add_box(page = 1, box = c(top = 232, height = 400, left = 303, width = 255), replace = FALSE)
UN$add_box(page = 2, box = c(top = 80, height = 595, left = 52, width = 255))
UN$add_box(page = 2, box = c(top = 80, height = 595, left = 303, width = 255), replace = FALSE)
UN$get_text_from_boxes(paragraphs = TRUE)
UN$xmlify()
UN$xml2html()
if (interactive()) UN$browse()
# Advanced scenario II: Get text from pdf with columns, long version
plenaryprotocol <- system.file(package = "trickypdf", "extdata", "pdf", "18238.pdf")
P <- PDF$new(filename_pdf = plenaryprotocol, first = 5, last = 73)
# P$show_pdf()
P$add_box(c(left = 58, width = 480, top = 70, height = 705))
P$remove_unboxed_text_from_all_pages()
P$deviation <- 10L
P$decolumnize()
P$get_text_from_pages()
P$purge()
P$xmlify()
P$xml2html()
if (interactive()) P$browse()
|
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.