PDE_pdfs2table_searchandfilter: Extracting tables from a PDF (Portable Document Format) file
In erikstricker/PDE: Extract Tables and Sentences from PDFs with User Interface

PDE_pdfs2table_searchandfilter

R Documentation

Extracting tables from a PDF (Portable Document Format) file

Description

PDE_pdfs2table_searchandfilter extracts tables from a single PDF file according to filter and search words and writes output in the corresponding folder.

Usage

PDE_pdfs2table_searchandfilter(
  pdfs,
  out = ".",
  filter.words = "",
  regex.fw = TRUE,
  ignore.case.fw = FALSE,
  filter.word.times = "0.2%",
  table.heading.words = "",
  ignore.case.th = FALSE,
  search.words,
  search.word.categories = NULL,
  save.tab.by.category = FALSE,
  regex.sw = TRUE,
  ignore.case.sw = FALSE,
  eval.abbrevs = TRUE,
  out.table.format = ".csv (WINDOWS-1252)",
  dev_x = 20,
  dev_y = 9999,
  write.table.locations = FALSE,
  exp.nondetc.tabs = TRUE,
  write.tab.doc.file = TRUE,
  delete = TRUE,
  cpy_mv = "nocpymv",
  verbose = TRUE
)

Arguments

`pdfs`	String. A list of paths to the PDF files to be analyzed.
`out`	String. Directory chosen to save analysis results in. Default: `"."`.
`filter.words`	List of strings. The list of filter words. If not `NA` or `""` a hit will be counted every time a word from the list is detected in the article. Default: `""`.
`regex.fw`	Logical. If TRUE filter words will follow the regex rules (see https://github.com/erikstricker/PDE/blob/master/inst/examples/cheatsheets/regex.pdf). Default = `TRUE`.
`ignore.case.fw`	Logical. Are the filter words case-sensitive (does capitalization matter)? Default: `FALSE`.
`filter.word.times`	Numeric or string. Can either be expressed as absolute number or percentage of the total number of words (by adding the " `filter.words` for a paper to be further analyzed. Default: `0.2%`.
`table.heading.words`	List of strings. Different than standard (TABLE, TAB or table plus number) headings to be detected. Regex rules apply (see also https://github.com/erikstricker/PDE/blob/master/inst/examples/cheatsheets/regex.pdf). Default = `""`.
`ignore.case.th`	Logical. Are the additional table headings (see `table.heading.words`) case-sensitive (does capitalization matter)? Default = `FALSE`.
`search.words`	List of strings. List of search words. To extract all tables from the PDF file leave `search.words = ""`.
`search.word.categories`	List of strings. List of categories with the same length as the list of search words. Accordingly, each search word can be assigned to a category, of which the word counts will be summarized in the `PDE_analyzer_word_stats.csv` file. If search.word.categories is a different length than search.words the parameter will be ignored. Default: `NULL`.
`save.tab.by.category`	Logical. Can only be used with search.word.categories. If set to TRUE, tables that carry search words will be saved in sub-folders according to the search word category of the detected search word. Default: `FALSE`.
`regex.sw`	Logical. If TRUE search words will follow the regex rules (see https://github.com/erikstricker/PDE/blob/master/inst/examples/cheatsheets/regex.pdf). Default = `TRUE`.
`ignore.case.sw`	Logical. Are the search words case-sensitive (does capitalization matter)? Default: `FALSE`.
`eval.abbrevs`	Logical. Should abbreviations for the search words be automatically detected and then replaced with the search word + "$*"? Default: `TRUE`.
`out.table.format`	String. Output file format. Either comma separated file `.csv` or tab separated file `.tsv`. The encoding indicated in parantheses should be selected according to the operational system exported tables are opened in, i.e., Windows: `"(WINDOWS-1252)"`; Mac: `(macintosh)`; Linux: `(UTF-8)`. Default: `".csv"` and encoding depending on the operational system.
`dev_x`	Numeric. For a table the size of indention which would be considered the same column. Default: `20`.
`dev_y`	Numeric. For a table the vertical distance which would be considered the same row. Can be either a number or set to dynamic detection [9999], in which case the font size is used to detect which words are in the same row. Default: `9999`.
`write.table.locations`	Logical. If `TRUE`, a separate file with the headings of all tables, their relative location in the generated html and txt files, as well as information if search words were found will be generated. Default: `FALSE`.
`exp.nondetc.tabs`	Logical. If `TRUE`, if a table was detected in a PDF file but is an image or cannot be read, the page with the table with be exported as a png. Default: `TRUE`.
`write.tab.doc.file`	Logical. If `TRUE`, if search words are used for table detection and no search words were found in the tables of a PDF file, a no.table.w.search.words. Default: `TRUE`.
`delete`	Logical. If `TRUE`, the intermediate txt, keeplayouttxt and html copies of the PDF file will be deleted. Default: `TRUE`.
`cpy_mv`	String. Either "nocpymv", "cpy", or "mv". If filter words are used in the analyses, the processed PDF files will either be copied ("cpy") or moved ("mv") into the /pdf/ subfolder of the output folder. Default: `"nocpymv"`.
`verbose`	Logical. Indicates whether messages will be printed in the console. Default: `TRUE`.

Value

If tables were extracted from the PDF file the function returns a list of following tables/items: 1) htmltablelines, 2) txttablelines, 3) keeplayouttxttablelines, 4) id, 5) out_msg. The tablelines are tables that provide the heading and position of the detected tables. The id provide the name of the PDF file. The out_msg includes all messages printed to the console or the suppressed messages if verbose=FALSE.

Examples


## Running a simple analysis with filter and search words to extract tables
if(PDE_check_Xpdf_install() == TRUE){
 outputtables <- PDE_pdfs2table_searchandfilter(pdf = paste0(system.file(package = "PDE"),
                                   "/examples/Methotrexate/29973177_!.pdf"),
 out = paste0(system.file(package = "PDE"),"/examples/29973177_tables/"),
 filter.words = strsplit("cohort;case-control;group;study population;study participants", ";")[[1]],
 regex.fw = FALSE,
 ignore.case.fw = TRUE,
 search.words = strsplit("(M|m)ethotrexate;(T|t)rexal;(R|r)heumatrex;(O|o)trexup", ";")[[1]],
 regex.sw = TRUE,
 ignore.case.sw = FALSE)
}

## Running an advanced analysis with filter and search words to
## extract tables and obtain documentation files
if(PDE_check_Xpdf_install() == TRUE){
 outputtables <- PDE_pdfs2table_searchandfilter(pdf = paste0(system.file(package = "PDE"),
                                   "/examples/Methotrexate/29973177_!.pdf"),
 out = paste0(system.file(package = "PDE"),"/examples/29973177_tables/"),
 dev_x = 20,
 dev_y = 9999,
 filter.words = strsplit("cohort;case-control;group;study population;study participants", ";")[[1]],
 regex.fw = FALSE,
 ignore.case.fw = TRUE,
 filter.word.times = "0.2%",
 table.heading.words = "",
 ignore.case.th = FALSE,
 search.words = strsplit("(M|m)ethotrexate;(T|t)rexal;(R|r)heumatrex;(O|o)trexup", ";")[[1]],
 regex.sw = TRUE,
 ignore.case.sw = FALSE,
 eval.abbrevs = TRUE,
 out.table.format = ".csv (WINDOWS-1252)",
 write.table.locations = TRUE,
 write.tab.doc.file = TRUE,
 exp.nondetc.tabs = TRUE,
 cpy_mv = "nocpymv",
 delete = TRUE)
}

erikstricker/PDE documentation built on June 14, 2024, 1:34 p.m.