PDE_pdfs2table_searchandfilter | R Documentation |
PDE_pdfs2table_searchandfilter
extracts tables from a single PDF file
according to filter and search words and writes output in the corresponding
folder.
PDE_pdfs2table_searchandfilter( pdfs, out = ".", filter.words = "", regex.fw = TRUE, ignore.case.fw = FALSE, filter.word.times = "0.2%", table.heading.words = "", ignore.case.th = FALSE, search.words, search.word.categories = NULL, regex.sw = TRUE, ignore.case.sw = FALSE, eval.abbrevs = TRUE, out.table.format = ".csv (WINDOWS-1252)", dev_x = 20, dev_y = 9999, write.table.locations = FALSE, exp.nondetc.tabs = TRUE, write.tab.doc.file = TRUE, delete = TRUE, cpy_mv = "nocpymv", verbose = TRUE )
pdfs |
String. A list of paths to the PDF files to be analyzed. |
out |
String. Directory chosen to save analysis results in. Default:
|
filter.words |
List of strings. The list of filter words. If not
|
regex.fw |
Logical. If TRUE filter words will follow the regex rules
(see https://github.com/erikstricker/PDE/blob/master/inst/examples/cheatsheets/regex.pdf).
Default = |
ignore.case.fw |
Logical. Are the filter words case-sensitive (does
capitalization matter)? Default: |
filter.word.times |
Numeric or string. Can either be expressed as absolute number or percentage
of the total number of words (by adding the "
|
table.heading.words |
List of strings. Different than standard (TABLE,
TAB or table plus number) headings to be detected. Regex rules apply (see
also
https://github.com/erikstricker/PDE/blob/master/inst/examples/cheatsheets/regex.pdf).
Default = |
ignore.case.th |
Logical. Are the additional table headings (see
|
search.words |
List of strings. List of search words. To extract all
tables from the PDF file leave |
search.word.categories |
List of strings. List of categories with the
same length as the list of search words. Accordingly, each search word can be
assigned to a category, of which the word counts will be summarized in the
|
regex.sw |
Logical. If TRUE search words will follow the regex rules
(see https://github.com/erikstricker/PDE/blob/master/inst/examples/cheatsheets/regex.pdf).
Default = |
ignore.case.sw |
Logical. Are the search words case-sensitive (does
capitalization matter)? Default: |
eval.abbrevs |
Logical. Should abbreviations for the search words be
automatically detected and then replaced with the search word + "$*"?
Default: |
out.table.format |
String. Output file format. Either comma separated
file |
dev_x |
Numeric. For a table the size of indention which would be
considered the same column. Default: |
dev_y |
Numeric. For a table the vertical distance which would be
considered the same row. Can be either a number or set to dynamic detection
[9999], in which case the font size is used to detect which words are in the
same row.
Default: |
write.table.locations |
Logical. If |
exp.nondetc.tabs |
Logical. If |
write.tab.doc.file |
Logical. If |
delete |
Logical. If |
cpy_mv |
String. Either "nocpymv", "cpy", or "mv". If filter words are used in the
analyses, the processed PDF files will either be copied ("cpy") or moved ("mv") into the
/pdf/ subfolder of the output folder. Default: |
verbose |
Logical. Indicates whether messages will be printed in the console. Default: |
If tables were extracted from the PDF file the function returns a list of
following tables/items: 1) htmltablelines, 2)
txttablelines, 3) keeplayouttxttablelines, 4) id,
5) out_msg.
The tablelines are tables that provide the heading and position of
the detected tables. The id provide the name of the PDF file. The
out_msg includes all messages printed to the console or the suppressed
messages if verbose=FALSE
.
PDE_extr_data_from_pdfs
, PDE_pdfs2table
## Running a simple analysis with filter and search words to extract tables if(PDE_check_Xpdf_install() == TRUE){ outputtables <- PDE_pdfs2table_searchandfilter(pdf = paste0(system.file(package = "PDE"), "/examples/Methotrexate/29973177_!.pdf"), out = paste0(system.file(package = "PDE"),"/examples/29973177_tables/"), filter.words = strsplit("cohort;case-control;group;study population;study participants", ";")[[1]], regex.fw = FALSE, ignore.case.fw = TRUE, search.words = strsplit("(M|m)ethotrexate;(T|t)rexal;(R|r)heumatrex;(O|o)trexup", ";")[[1]], regex.sw = TRUE, ignore.case.sw = FALSE) } ## Running an advanced analysis with filter and search words to ## extract tables and obtain documentation files if(PDE_check_Xpdf_install() == TRUE){ outputtables <- PDE_pdfs2table_searchandfilter(pdf = paste0(system.file(package = "PDE"), "/examples/Methotrexate/29973177_!.pdf"), out = paste0(system.file(package = "PDE"),"/examples/29973177_tables/"), dev_x = 20, dev_y = 9999, filter.words = strsplit("cohort;case-control;group;study population;study participants", ";")[[1]], regex.fw = FALSE, ignore.case.fw = TRUE, filter.word.times = "0.2%", table.heading.words = "", ignore.case.th = FALSE, search.words = strsplit("(M|m)ethotrexate;(T|t)rexal;(R|r)heumatrex;(O|o)trexup", ";")[[1]], regex.sw = TRUE, ignore.case.sw = FALSE, eval.abbrevs = TRUE, out.table.format = ".csv (WINDOWS-1252)", write.table.locations = TRUE, write.tab.doc.file = TRUE, exp.nondetc.tabs = TRUE, cpy_mv = "nocpymv", delete = TRUE) }
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.