Description Usage Arguments Value See Also Examples
PDE_extr_data_from_pdfs
extracts sentences or tables from a single PDF
file and writes output in the corresponding folder.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 | PDE_extr_data_from_pdfs(
pdfs,
whattoextr,
out = ".",
filter.words = "",
ignore.case.fw = FALSE,
filter.word.times = 20,
table.heading.words = "",
ignore.case.th = FALSE,
search.words,
ignore.case.sw = FALSE,
eval.abbrevs = TRUE,
out.table.format = ".csv (WINDOWS-1252)",
dev_x = 20,
dev_y = 9999,
context = 0,
write.table.locations = FALSE,
exp.nondetc.tabs = TRUE,
write.tab.doc.file = TRUE,
write.txt.doc.file = TRUE,
delete = TRUE,
verbose = TRUE
)
|
pdfs |
String. A list of paths to the PDF files to be analyzed. |
whattoextr |
String. Either txt, tab, or tabandtxt for PDFS2TXT (extract sentences from a PDF file) or PDFS2TABLE (table of a PDF file to a Microsoft Excel file) extraction. tab allows the extraction of tables with and without search words while txt and tabandtxt require search words. |
out |
String. Directory chosen to save analysis results in. Default:
|
filter.words |
List of strings. The list of filter words. If not
|
ignore.case.fw |
Logical. Are the filter words case-sensitive (does
capitalization matter)? Default: |
filter.word.times |
Numeric. The minimum number of hits described for
|
table.heading.words |
List of strings. Different than standard (TABLE,
TAB or table plus number) headings to be detected. Regex rules apply (see
also
https://www.rstudio.com/wp-content/uploads/2016/09/RegExCheatsheet.pdf).
Default = |
ignore.case.th |
Logical. Are the additional table headings (see
|
search.words |
List of strings. List of search words. To extract all
tables from the PDF files leave |
ignore.case.sw |
Logical. Are the search words case-sensitive (does
capitalization matter)? Default: |
eval.abbrevs |
Logical. Should abbreviations for the search words be
automatically detected and then replaced with the search word + "$*"?
Default: |
out.table.format |
String. Output file format. Either comma separated
file |
dev_x |
Numeric. For a table the size of indention which would be
considered the same column. Default: |
dev_y |
Numeric. For a table the vertical distance which would be
considered the same row. Can be either a number or set to dynamic detection
[9999], in which case the font size is used to detect which words are in the
same row.
Default: |
context |
Numeric. Number of sentences extracted before and after the
sentence with the detected search word. If |
write.table.locations |
Logical. If |
exp.nondetc.tabs |
Logical. If |
write.tab.doc.file |
Logical. If |
write.txt.doc.file |
Logical. If |
delete |
Logical. If |
verbose |
Logical. Indicates whether messages will be printed in the console. Default: |
If tables were extracted from the PDF file the function returns a list of
following tables/items: 1) htmltablelines, 2)
txttablelines, 3) keeplayouttxttablelines, 4) id,
5) out_msg.
The tablelines are tables that provide the heading and position of
the detected tables. The id provide the name of the PDF file. The
out_msg includes all messages printed to the console or the suppressed
messages if verbose=FALSE
.
PDE_pdfs2table
,PDE_pdfs2table_searchandfilter
,PDE_pdfs2txt_searchandfilter
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 | ## Running a simple analysis with filter and search words to extract sentences and tables
if(PDE_check_Xpdf_install() == TRUE){
outputtables <- PDE_extr_data_from_pdfs(pdfs = c(paste0(system.file(package = "PDE"),
"examples/Methotrexate/29973177_!.pdf"),
paste0(system.file(package = "PDE"),
"examples/Methotrexate/31083238_!.pdf")),
whattoextr = "tabandtxt",
out = paste0(system.file(package = "PDE"),"/examples/MTX_all_files+-0_test/"),
filter.words = strsplit("cohort;case-control;group;study population;study participants", ";")[[1]],
ignore.case.fw = TRUE,
search.words = strsplit("(M|m)ethotrexate;(T|t)rexal;(R|r)heumatrex;(O|o)trexup", ";")[[1]],
ignore.case.sw = FALSE)
}
## Running an advanced analysis with filter and search words to
## extract sentences and tables and obtain documentation files
if(PDE_check_Xpdf_install() == TRUE){
outputtables <- PDE_extr_data_from_pdfs(pdfs = c(paste0(system.file(package = "PDE"),
"examples/Methotrexate/29973177_!.pdf"),
paste0(system.file(package = "PDE"),
"examples/Methotrexate/31083238_!.pdf")),
whattoextr = "tabandtxt",
out = paste0(system.file(package = "PDE"),"/examples/MTX_all_files+-1_test/"),
context = 1,
dev_x = 20,
dev_y = 9999,
filter.words = strsplit("cohort;case-control;group;study population;study participants",";")[[1]],
ignore.case.fw = TRUE,
filter.word.times = 20,
table.heading.words = "",
ignore.case.th = FALSE,
search.words = strsplit("(M|m)ethotrexate;(T|t)rexal;(R|r)heumatrex;(O|o)trexup", ";")[[1]],
ignore.case.sw = FALSE,
eval.abbrevs = TRUE,
out.table.format = ".csv (WINDOWS-1252)",
write.table.locations = TRUE,
write.tab.doc.file = TRUE,
write.txt.doc.file = TRUE,
exp.nondetc.tabs = TRUE,
delete = TRUE)
}
|
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.