PDF-to-XML conversion of scientific articles using pdfx

Share:

Description

Uses a web service provided by Utopia at http://pdfx.cs.man.ac.uk/. Beware, this can be quite slow. pdfx posts the pdf from your machine to the web service, pdfx_html takes the output of pdfx and gives back a html version of extracted text, and pdfx_targz gives a tar.gz version of the extracted text. This will not work with PDFs that are scans of text, or mostly of images.

Usage

1
2
3
4
5
pdfx(file, what = "parsed", ...)

pdfx_html(input, ...)

pdfx_targz(input, write_path, ...)

Arguments

file

(character) Path to a file, or files on your machine. Required.

what

(character) One of parsed or text.

...

Further args passed to GET. These aren't named, so just do e.g. , verbose(), or timeout(3)

input

Output from pdfx function

write_path

Path to write tar ball to.

Value

pdfx gives XML parsed to xml_document, pdfx_html gives html, pdfx_targz writes a tar.gz file to disk.

Examples

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
## Not run: 
path <- system.file("examples", "example2.pdf", package = "fulltext")
pdfx(file = path)

out <- pdfx(file = path)
pdfx_html(out)

out <- pdfx(file = path)
tarfile <- tempfile(fileext = "tar.gz")
pdfx_targz(input = out, write_path = tarfile)

## End(Not run)