xml for corpus preparation

Description Usage Format Fields Methods Arguments Examples

To get the margins of the text: Use Rechteckige Auswahl, and Werkzeuge/Allgemeine Informationen

PDF

An object of class R6ClassGenerator of length 24.

filename_pdf: a character vector (length 1) providing a filename
first: first integer, the first page
last: last integer, the last page
page: a specific page number
importFrom: xml2 read_xml xml_find_all xml_attrs
xml: parsed xml
text: a list of character vectors
jitter: integer, the deviation in lines that will be checked
margin: a named integer vector ("top", "bottom", "left", "right") indicating the margins of the text
deviation: allowed deviation of columns from page center
xmlification: xml to output
no_pages: number of pages of the pdf document (after pdf2xml)

$new(filename_pdf, first = NA, last = NA, jitter = 2, deviation = 10L, margins = integer()): Initialize a new instance of the class PDF to process a pdf document.
$validate(): Check all fields for correct content.
$show_pdf(): Show the pdf document.
$make_box(box = NULL, page): Generate a box for the data.frame in the field $boxes. Coordinates a assumed to be in points and are recalibrated into pdf units.
$add_box(box = NULL, page = NULL, replace = TRUE): Add a box that will serve as a crop box.
$drop_unboxed_text_nodes(node, boxes, copy = FALSE): Remove any nodes that are not within defined boxes.
drop_page(page): drop a page from from the XML of the pdf document
$remove_unboxed_text_from_all_pages(): Remove anything that is printed on pages beyond the defined boxes.
$decolumnize: Remove columnization, if pages are typeset with two columns. Muli-column layouts with three or more columns are not supported so far. The procedure adjusts the coordinates of text right of the the horizontal page center, i.e. the page height is added to the top position, and half of the page width substracted from the left position.
$get_pagesizes(): Get page width and height (points/pts and pdf units). The pdf units are extracted from the xmlified pdf document. To get sizes in points (pts), pdf_pagesize (package pdftools) is used. The result is a data.frame in the field pagesizes. The method is called when parsing the pdf document.
$get_text(node, paragraphs = TRUE): Get the text from document in field 'xml'.
$get_number_of_pages(): get number of pages of XML document of pdf
$get_text_from_pages(paragraphs = TRUE)
$get_text_from_boxes(paragraphs): Iterate through pages, and extract text as defined by boxes from pages. The result will be assigned to field pages.
regex: Find matches for regex on pages. The method returns the pages with at least one match for the regex.
$reorder(): Reorder text nodes on a page. Not yet functional!
$cut(): NOT WORKING
$reconstruct_paragraphs: Reconstruct paragraphs based on the following heuristic: If a line ends with a hyphen and is not stump, lines are concatenated.
$purge(): Remove noise, surplus whitespace signs from the text.
$xmlify(root = "document", metadata = NULL): Turn content of field 'pages' into a XML document, optionally adding metadata.
$xml2md: Turn xmlified document into markdown (will be stored in field 'markdown').
$md2html(): Turn markdown (field markdown) into html document that will be stored in the field html.
$xml2html(): Turn xmlification of pdf document into html document to support quality checks.
$browse(viewer = getOption("viewer", utils::browseURL)): Show html document in browser.
$wrixte(): Save xmlified document (available in the field 'xmlification') to a file.

filename_pdf: path to a pdf document
first: first page of the pdf document to be included
last: last page of the pdf document to be included
jitter: points up and down for reconstructing tilted lines
deviation: points that lines may deviate from middle of page
margins: named vector (top, bottom, left, right)
root: name of the root node
metadata: named character vector, attribtutes of root node of output xml document
filename: character vector
viewer: the viewer to use to inspect pdf or html documents

# Basic scenario: A straight-forward pdf without columns;
# it's only page numbers and text on the margins that disturbs

cdu_pdf <- system.file(package = "trickypdf", "extdata", "pdf", "cdu.pdf")
P <- PDF$new(filename_pdf = cdu_pdf, first = 7, last = 119)
# P$show_pdf()
P$add_box(box = c(top = 75, height = 700, left = 44, width = 500))
P$remove_unboxed_text_from_all_pages()
P$get_text_from_pages()
P$purge()
P$xmlify()
P$xml2html()
# P$browse()
output <- tempfile(fileext = ".xml")
P$write(filename = output)


# Advanced scenario I: Get text from pdf with columns, here: define boxes

doc <- system.file(package = "trickypdf", "extdata", "pdf", "UN_GeneralAssembly_2016.pdf")
UN <- PDF$new(filename_pdf = doc)
# UN$show_pdf()
UN$add_box(page = 1, box = c(top = 380, height = 250, left = 52, width = 255))
UN$add_box(page = 1, box = c(top = 232, height = 400, left = 303, width = 255), replace = FALSE)
UN$add_box(page = 2, box = c(top = 80, height = 595, left = 52, width = 255))
UN$add_box(page = 2, box = c(top = 80, height = 595, left = 303, width = 255), replace = FALSE)
UN$get_text_from_boxes(paragraphs = TRUE)
UN$xmlify()
UN$xml2html()
if (interactive()) UN$browse()

# Advanced scenario II: Get text from pdf with columns, long version

plenaryprotocol <- system.file(package = "trickypdf", "extdata", "pdf", "18238.pdf")
P <- PDF$new(filename_pdf = plenaryprotocol, first = 5, last = 73)
# P$show_pdf()
P$add_box(c(left = 58, width = 480, top = 70, height = 705))
P$remove_unboxed_text_from_all_pages()
P$deviation <- 10L
P$decolumnize()
P$get_text_from_pages()
P$purge()
P$xmlify()
P$xml2html()
if (interactive()) P$browse()