PDE_pdfs2table: Extracting all tables from a PDF (Portable Document Format)...
In PDE: Extract Tables and Sentences from PDFs with User Interface

PDE_pdfs2table

R Documentation

Extracting all tables from a PDF (Portable Document Format) file

Description

PDE_pdfs2table extracts all tables from a single PDF file and writes output in the corresponding folder.

Usage

PDE_pdfs2table(
  pdfs,
  out = ".",
  table.heading.words = "",
  ignore.case.th = FALSE,
  out.table.format = ".csv (WINDOWS-1252)",
  dev_x = 20,
  dev_y = 9999,
  write.table.locations = FALSE,
  exp.nondetc.tabs = TRUE,
  delete = TRUE,
  verbose = TRUE
)

Arguments

`pdfs`	String. A list of paths to the PDF files to be analyzed.
`out`	String. Directory chosen to save tables in. Default: `"."`.
`table.heading.words`	List of strings. Different than standard (TABLE, TAB or table plus number) headings to be detected. Regex rules apply (see also https://github.com/erikstricker/PDE/blob/master/inst/examples/cheatsheets/regex.pdf). Default = `""`.
`ignore.case.th`	Logical. Are the additional table headings (see `table.heading.words`) case-sensitive (does capitalization matter)? Default = `FALSE`.
`out.table.format`	String. Output file format. Either comma separated file `.csv` or tab separated file `.tsv`. The encoding indicated in parantheses should be selected according to the operational system exported tables are opened in, i.e., Windows: `"(WINDOWS-1252)"`; Mac: `(macintosh)`; Linux: `(UTF-8)`. Default: `".csv"` and encoding depending on the operational system.
`dev_x`	Numeric. For a table the size of indention which would be considered the same column. Default: `20`.
`dev_y`	Numeric. For a table the vertical distance which would be considered the same row. Can be either a number or set to dynamic detection [9999], in which case the font size is used to detect which words are in the same row. Default: `9999`.
`write.table.locations`	Logical. If `TRUE`, a separate file with the headings of all tables, their relative location in the generated html and txt files, as well as information if search words were found will be generated. Default: `FALSE`.
`exp.nondetc.tabs`	Logical. If `TRUE`, if a table was detected in a PDF file but is an image or cannot be read, the page with the table with be exported as a png. Default: `FALSE`.
`delete`	Logical. If `TRUE`, the intermediate txt, keeplayouttxt and html copies of the PDF file will be deleted. Default: `TRUE`.
`verbose`	Logical. Indicates whether messages will be printed in the console. Default: `TRUE`.

Examples

## Running a simple table extraction
if(PDE_check_Xpdf_install() == TRUE){
outputtables <- PDE_pdfs2table(pdf = paste0(system.file(package = "PDE"),
                 "/examples/Methotrexate/29973177_!.pdf"),
 out = paste0(system.file(package = "PDE"),"/examples/29973177_tables/"))
}

## Running a the same table extraction as above with all paramaters shown
if(PDE_check_Xpdf_install() == TRUE){
 outputtables <- PDE_pdfs2table(pdf = paste0(system.file(package = "PDE"),
                                 "/examples/Methotrexate/29973177_!.pdf"),
 out = paste0(system.file(package = "PDE"),"/examples/29973177_tables/"),
 dev_x = 20,
 dev_y = 9999,
 table.heading.words = "",
 ignore.case.th = FALSE,
 out.table.format = ".csv (WINDOWS-1252)",
 write.table.locations = FALSE,
 exp.nondetc.tabs = FALSE,
 delete = TRUE)
}

PDE documentation built on June 22, 2024, 10:44 a.m.