read_pdf: Read a Portable Document Format into R

Description Usage Arguments Value Note Examples

View source: R/read_pdf.R

Description

A wrapper for pdftools::pdf_text() to read PDFs into R.

Usage

1
read_pdf(file, skip = 0, remove.empty = TRUE, trim = TRUE, ocr = TRUE, ...)

Arguments

file

A path to a PDF file.

skip

Integer; the number of lines of the data file to skip before beginning to read data.

remove.empty

logical. If TRUE empty elements in the vector are removed.

trim

logical. If TRUE the leading/training white space is removed.

ocr

logical. If TRUE documents with a non-text pull using pdftools::pdf_text() will be re-run using OCR via the tesseract::ocr() function. This will create temporary .png files and will require a much larger compute time.

...

Other arguments passed to pdftools::pdf_text().

Value

Returns a base::data.frame() with the page number (page_id), line number (element_id), and the text.

Note

A word of caution from Carl Witthoft" "Just a warning to others who may be hoping to extract data: PDF is a container, not a format. If the original document does not contain actual text, as opposed to bitmapped images of text or possibly even uglier things than I can imagine, nothing other than OCR can help you." If the reader has OCR needs the tesseract package, available on CRAN (https://CRAN.R-project.org/package=tesseract), is an "OCR engine with Unicode (UTF-8) support" and may be of use.

Examples

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
pdf_dat <- read_pdf(
    system.file("docs/rl10075oralhistoryst002.pdf", package = "textreadr")
)

pdf_dat_b <- read_pdf(
    system.file("docs/rl10075oralhistoryst002.pdf", package = "textreadr"),
    skip = 1
)

## Not run: 
library(textshape)
system.file("docs/rl10075oralhistoryst002.pdf", package = "textreadr") %>%
    read_pdf(1) %>%
    `[[`('text') %>%
    head(-1) %>%
    textshape::combine() %>%
    gsub("([A-Z])( )([A-Z])", "\\1_\\3", .) %>%
    strsplit("(-| )(?=[A-Z_]+:)", perl=TRUE) %>%
    `[[`(1) %>%
    textshape::split_transcript()

## End(Not run)

## Not run: 
## An image based .pdf file returns nothing.  Using the tesseract package as
## a backend for OCR overcomes this problem.

## Non-ocr
read_pdf(
    system.file("docs/McCune2002Choi2010.pdf", package = "textreadr"),
    ocr = FALSE
)

read_pdf(
    system.file("docs/McCune2002Choi2010.pdf", package = "textreadr"),
    ocr = TRUE
)

## End(Not run)

textreadr documentation built on Oct. 9, 2021, 5:06 p.m.