R/get_pdf_text.R

#' @importFrom pdftools pdf_text
get_pdf_text <- function (file, ...){
  text <- pdftools::pdf_text(file, ...)
  Encoding(text) <- "UTF-8"
  text <- strsplit(text, split = "\\r\\n")

  text <- lapply(text, function(x) x[!grepl("^\\s*$", x)])
  out <- data.frame(page_id = rep(seq_along(text), sapply(text,
                                                          length)), element_id = unlist(sapply(text, function(x) seq_len(length(x)))),
                    text = unlist(text), stringsAsFactors = FALSE)
  out[["text"]] <- trimws(out[["text"]])
  out
}

Try the revise package in your browser

Any scripts or data that you put into this service are public.

revise documentation built on April 3, 2025, 11:47 p.m.