crm_extract | R Documentation |
Extract text from a single pdf document
crm_extract(path = NULL, raw = NULL, try_ocr = FALSE, ...)
path |
(character) path to a file, file must exist |
raw |
(raw) raw bytes |
try_ocr |
(logical) whether to try extracting OCRed
pages with |
... |
args passed on to |
We use pdftools under the hood to do pdf text extraction.
You have to supply either path
or raw
- not both.
An object of class crm_pdf
with a slot for
info
(pdf metadata essentially), and text
(the extracted
text) - with an attribute (path
) with the path to the pdf
on disk
path <- system.file("examples", "MairChamberlain2014RJournal.pdf", package = "crminer") (res <- crm_extract(path)) res$info res$text # with newlines, pretty print cat(res$text) # another example path <- system.file("examples", "ChamberlainEtal2013Ecosphere.pdf", package = "crminer") (res <- crm_extract(path)) res$info cat(res$text) # with raw pdf bytes path <- system.file("examples", "raw-example.rds", package = "crminer") rds <- readRDS(path) class(rds) crm_extract(raw = rds)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.