PDE_pdfs2txt_searchandfilter: Extracting sentences from a PDF (Portable Document Format)...

Description Usage Arguments See Also Examples

View source: R/PDE.R

Description

PDE_pdfs2txt_searchandfilter extracts sentences from a single PDF file according to search and filter words and writes output in the corresponding folder.

Usage

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
PDE_pdfs2txt_searchandfilter(
  pdfs,
  out = ".",
  filter.words = "",
  ignore.case.fw = FALSE,
  filter.word.times = 20,
  search.words,
  ignore.case.sw = FALSE,
  eval.abbrevs = TRUE,
  out.table.format = ".csv (WINDOWS-1252)",
  context = 0,
  write.txt.doc.file = TRUE,
  delete = TRUE,
  verbose = TRUE
)

Arguments

pdfs

String. A list of paths to the PDF files to be analyzed.

out

String. Directory chosen to save analysis results in. Default: ".".

filter.words

List of strings. The list of filter words. If not NA or "" a hit will be counted every time a word from the list is detected in the article. Regex rules apply (see also https://www.rstudio.com/wp-content/uploads/2016/09/RegExCheatsheet.pdf). Default: "".

ignore.case.fw

Logical. Are the filter words case-sensitive (does capitalization matter)? Default: FALSE.

filter.word.times

Numeric. The minimum number of hits described for filter.words for a paper to be further analyzed. Default: 20.

search.words

List of strings. List of search words. Regex rules apply (see also https://www.rstudio.com/wp-content/uploads/2016/09/RegExCheatsheet.pdf).

ignore.case.sw

Logical. Are the search words case-sensitive (does capitalization matter)? Default: FALSE.

eval.abbrevs

Logical. Should abbreviations for the search words be automatically detected and then replaced with the search word + "$*"? Default: TRUE.

out.table.format

String. Output file format. Either comma separated file .csv or tab separated file .tsv. The encoding indicated in parantheses should be selected according to the operational system exported tables are opened in, i.e., Windows: "(WINDOWS-1252)"; Mac: (macintosh); Linux: (UTF-8). Default: ".csv" and encoding depending on the operational system.

context

Numeric. Number of sentences extracted before and after the sentence with the detected search word. If 0 only the sentence with the search word is extracted. Default: 0.

write.txt.doc.file

Logical. If TRUE, if no search words were found in the sentences of a PDF file, a file will be created with the PDF filename followed by no.txt.w.search.words. If the PDF file is empty, a file will be created with the PDF filename followed by no.content.detected. If the filter word threshold is not met, a file will be created with the PDF filename followed by no.txt.w.filter.words. Default: TRUE.

delete

Logical. If TRUE, the intermediate txt, keeplayouttxt and html copies of the PDF file will be deleted. Default: TRUE.

verbose

Logical. Indicates whether messages will be printed in the console. Default: TRUE.

See Also

PDE_extr_data_from_pdfs

Examples

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
## Running a simple analysis with filter and search words to extract sentences
if(PDE_check_Xpdf_install() == TRUE){
 outputtables <- PDE_pdfs2txt_searchandfilter(pdf = paste0(system.file(package = "PDE"),
                                      "examples/Methotrexate/29973177_!.pdf"),
 out = paste0(system.file(package = "PDE"),"/examples/MTX_txt+-0/"),
 filter.words = strsplit("cohort;case-control;group;study population;study participants", ";")[[1]],
 ignore.case.fw = TRUE,
 search.words = strsplit("(M|m)ethotrexate;(T|t)rexal;(R|r)heumatrex;(O|o)trexup", ";")[[1]],
 ignore.case.sw = FALSE)
}

## Running an advanced analysis with filter and search words to
## extract sentences and obtain documentation files
if(PDE_check_Xpdf_install() == TRUE){
 outputtables <- PDE_pdfs2txt_searchandfilter(pdf = paste0(system.file(package = "PDE"),
                                       "examples/Methotrexate/29973177_!.pdf"),
 out = paste0(system.file(package = "PDE"),"/examples/MTX_txt+-1/"),
 context = 1,
 filter.words = strsplit("cohort;case-control;group;study population;study participants", ";")[[1]],
 ignore.case.fw = TRUE,
 filter.word.times = 20,
 search.words = strsplit("(M|m)ethotrexate;(T|t)rexal;(R|r)heumatrex;(O|o)trexup", ";")[[1]],
 ignore.case.sw = FALSE,
 eval.abbrevs = TRUE,
 out.table.format = ".csv (WINDOWS-1252)",
 write.txt.doc.file = TRUE,
 delete = TRUE)
}

PDE documentation built on April 1, 2021, 5:06 p.m.