crm_extract: Extract text from a single pdf document
In crminer: Fetch 'Scholary' Full Text from 'Crossref'

Description Usage Arguments Details Value Examples

Extract text from a single pdf document

1	crm_extract(path = NULL, raw = NULL, try_ocr = FALSE, ...)

`path`	(character) path to a file, file must exist
`raw`	(raw) raw bytes
`try_ocr`	(logical) whether to try extracting OCRed pages with `pdftools::pdf_ocr_text()`. default: `FALSE`. if `FALSE`, we use `pdftools::pdf_text()`
`...`	args passed on to `pdftools::pdf_info()` and `pdftools::pdf_text()` (or `pdftools::pdf_ocr_text()` if `try_ocr=TRUE`) - any args are passed to both of those function calls, which makes sense

We use pdftools under the hood to do pdf text extraction.

You have to supply either path or raw - not both.

An object of class crm_pdf with a slot for info (pdf metadata essentially), and text (the extracted text) - with an attribute (path) with the path to the pdf on disk

path <- system.file("examples", "MairChamberlain2014RJournal.pdf",
   package = "crminer")
(res <- crm_extract(path))
res$info
res$text
# with newlines, pretty print
cat(res$text)

# another example
path <- system.file("examples", "ChamberlainEtal2013Ecosphere.pdf",
   package = "crminer")
(res <- crm_extract(path))
res$info
cat(res$text)

# with raw pdf bytes
path <- system.file("examples", "raw-example.rds", package = "crminer")
rds <- readRDS(path)
class(rds)
crm_extract(raw = rds)