PDE_pdfs2txt_searchandfilter: Extracting sentences from a PDF (Portable Document Format)...
In PDE: Extract Tables and Sentences from PDFs with User Interface

PDE_pdfs2txt_searchandfilter

R Documentation

Extracting sentences from a PDF (Portable Document Format) file

Description

PDE_pdfs2txt_searchandfilter extracts sentences from a single PDF file according to search and filter words and writes output in the corresponding folder.

Usage

PDE_pdfs2txt_searchandfilter(
  pdfs,
  out = ".",
  filter.words = "",
  regex.fw = TRUE,
  ignore.case.fw = FALSE,
  filter.word.times = "0.2%",
  search.words,
  search.word.categories = NULL,
  regex.sw = TRUE,
  ignore.case.sw = FALSE,
  eval.abbrevs = TRUE,
  out.table.format = ".csv (WINDOWS-1252)",
  context = 0,
  write.txt.doc.file = TRUE,
  delete = TRUE,
  cpy_mv = "nocpymv",
  verbose = TRUE
)

Arguments

`pdfs`	String. A list of paths to the PDF files to be analyzed.
`out`	String. Directory chosen to save analysis results in. Default: `"."`.
`filter.words`	List of strings. The list of filter words. If not `NA` or `""` a hit will be counted every time a word from the list is detected in the article. Default: `""`.
`regex.fw`	Logical. If TRUE filter words will follow the regex rules (see https://github.com/erikstricker/PDE/blob/master/inst/examples/cheatsheets/regex.pdf). Default = `TRUE`.
`ignore.case.fw`	Logical. Are the filter words case-sensitive (does capitalization matter)? Default: `FALSE`.
`filter.word.times`	Numeric or string. Can either be expressed as absolute number or percentage of the total number of words (by adding the " `filter.words` for a paper to be further analyzed. Default: `0.2%`.
`search.words`	List of strings. List of search words.
`search.word.categories`	List of strings. List of categories with the same length as the list of search words. Accordingly, each search word can be assigned to a category, of which the word counts will be summarized in the `PDE_analyzer_word_stats.csv` file. If search.word.categories is a different length than search.words the parameter will be ignored. Default: `NULL`.
`regex.sw`	Logical. If TRUE search words will follow the regex rules (see https://github.com/erikstricker/PDE/blob/master/inst/examples/cheatsheets/regex.pdf). Default = `TRUE`.
`ignore.case.sw`	Logical. Are the search words case-sensitive (does capitalization matter)? Default: `FALSE`.
`eval.abbrevs`	Logical. Should abbreviations for the search words be automatically detected and then replaced with the search word + "$*"? Default: `TRUE`.
`out.table.format`	String. Output file format. Either comma separated file `.csv` or tab separated file `.tsv`. The encoding indicated in parantheses should be selected according to the operational system exported tables are opened in, i.e., Windows: `"(WINDOWS-1252)"`; Mac: `(macintosh)`; Linux: `(UTF-8)`. Default: `".csv"` and encoding depending on the operational system.
`context`	Numeric. Number of sentences extracted before and after the sentence with the detected search word. If `0` only the sentence with the search word is extracted. Default: `0`.
`write.txt.doc.file`	Logical. If `TRUE`, if no search words were found in the sentences of a PDF file, a file will be created with the PDF filename followed by no.txt.w.search.words. If the PDF file is empty, a file will be created with the PDF filename followed by no.content.detected. If the filter word threshold is not met, a file will be created with the PDF filename followed by no.txt.w.filter.words. Default: `TRUE`.
`delete`	Logical. If `TRUE`, the intermediate txt, keeplayouttxt and html copies of the PDF file will be deleted. Default: `TRUE`.
`cpy_mv`	String. Either "nocpymv", "cpy", or "mv". If filter words are used in the analyses, the processed PDF files will either be copied ("cpy") or moved ("mv") into the /pdf/ subfolder of the output folder. Default: `"nocpymv"`.
`verbose`	Logical. Indicates whether messages will be printed in the console. Default: `TRUE`.

Examples

## Running a simple analysis with filter and search words to extract sentences
if(PDE_check_Xpdf_install() == TRUE){
 outputtables <- PDE_pdfs2txt_searchandfilter(pdf = paste0(system.file(package = "PDE"),
                                      "/examples/Methotrexate/29973177_!.pdf"),
 out = paste0(system.file(package = "PDE"),"/examples/MTX_txt+-0/"),
 filter.words = strsplit("cohort;case-control;group;study population;study participants", ";")[[1]],
 regex.fw = FALSE,
 ignore.case.fw = TRUE,
 search.words = strsplit("(M|m)ethotrexate;(T|t)rexal;(R|r)heumatrex;(O|o)trexup", ";")[[1]],
 regex.sw = TRUE,
 ignore.case.sw = FALSE)
}

## Running an advanced analysis with filter and search words to
## extract sentences and obtain documentation files
if(PDE_check_Xpdf_install() == TRUE){
 outputtables <- PDE_pdfs2txt_searchandfilter(pdf = paste0(system.file(package = "PDE"),
                                       "/examples/Methotrexate/29973177_!.pdf"),
 out = paste0(system.file(package = "PDE"),"/examples/MTX_txt+-1/"),
 context = 1,
 filter.words = strsplit("cohort;case-control;group;study population;study participants", ";")[[1]],
 regex.fw = FALSE,
 ignore.case.fw = TRUE,
 filter.word.times = "0.2%",
 search.words = strsplit("(M|m)ethotrexate;(T|t)rexal;(R|r)heumatrex;(O|o)trexup", ";")[[1]],
 regex.sw = TRUE,
 ignore.case.sw = FALSE,
 eval.abbrevs = TRUE,
 out.table.format = ".csv (WINDOWS-1252)",
 write.txt.doc.file = TRUE,
 cpy_mv = "nocpymv",
 delete = TRUE)
}

PDE documentation built on June 22, 2024, 10:44 a.m.