In balthasars/pdfparser: Tools for advanced PDF text extraction.

knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>",
  fig.path = "man/figures/README-",
  out.width = "100%"
)

`{pdfparser}`

{pdfparser} provides tools for dealing with text extraction from PDFs.

It comes in handy when you do not only want to read in the text, for which I recommendbut also want to deal with PDF coordinates of the words, lines and blocks.

Right now its only function is read_bbox_layout_xhtml() which parses XHTML files from pdftotext, part of poppler-utils (manual can be found here).

Installation

You can install the development version from GitHub with:

# install.packages("devtools")
devtools::install_github("balthasars/pdfparser")

Example

This is a basic example which shows you how to solve a common problem:

library(pdfparser)
doc <- system.file("extdata", "edi_2009_frcho43c6mmlx5lyohqy_doc#immrrkosg.html", package = "pdfparser")
read_bbox_layout_xhtml(doc)