Description Usage Arguments Value Note Examples
A wrapper for pdftools::pdf_text() to read PDFs into R.
1 |
file |
A path to a PDF file. |
skip |
Integer; the number of lines of the data file to skip before beginning to read data. |
remove.empty |
logical. If |
trim |
logical. If |
ocr |
logical. If |
... |
Other arguments passed to pdftools::pdf_text(). |
Returns a base::data.frame()
with the page number
(page_id
), line number (element_id
), and the text
.
A word of caution from Carl Witthoft" "Just a warning to others who may be hoping to extract data: PDF is a container, not a format. If the original document does not contain actual text, as opposed to bitmapped images of text or possibly even uglier things than I can imagine, nothing other than OCR can help you." If the reader has OCR needs the tesseract package, available on CRAN (https://CRAN.R-project.org/package=tesseract), is an "OCR engine with Unicode (UTF-8) support" and may be of use.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 | pdf_dat <- read_pdf(
system.file("docs/rl10075oralhistoryst002.pdf", package = "textreadr")
)
pdf_dat_b <- read_pdf(
system.file("docs/rl10075oralhistoryst002.pdf", package = "textreadr"),
skip = 1
)
## Not run:
library(textshape)
system.file("docs/rl10075oralhistoryst002.pdf", package = "textreadr") %>%
read_pdf(1) %>%
`[[`('text') %>%
head(-1) %>%
textshape::combine() %>%
gsub("([A-Z])( )([A-Z])", "\\1_\\3", .) %>%
strsplit("(-| )(?=[A-Z_]+:)", perl=TRUE) %>%
`[[`(1) %>%
textshape::split_transcript()
## End(Not run)
## Not run:
## An image based .pdf file returns nothing. Using the tesseract package as
## a backend for OCR overcomes this problem.
## Non-ocr
read_pdf(
system.file("docs/McCune2002Choi2010.pdf", package = "textreadr"),
ocr = FALSE
)
read_pdf(
system.file("docs/McCune2002Choi2010.pdf", package = "textreadr"),
ocr = TRUE
)
## End(Not run)
|
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.