loop_utilities: Utilities for Looping to Read In Documents

Description Usage Arguments Value Examples

Description

loop_counter - A simple loop counter for tracking the progress of reading in a batch of files.

base_name - Like base::basename but doesn't choke on long paths.

try_limit - Limits the amount of try that an expression can run for. This works to limit how long an attempted read-in of a document may take. Most useful in a loop with a few very long running document read-ins (e.g., .pdf files that require tesseract package). Note that max.time can not stop a system call (as many read-in functions are essentially utilizing, but it can limit how many system calls are made. This means a .pdf with multiple tesseract) pages will only allow the first page to read-in before returning an error result. Note that this approach does not distinguish between errors running the expr and time-out errors.

Usage

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
loop_counter(i, total, file, ...)

base_name(path)

try_limit(
  expr,
  max.time = Inf,
  timeout.return = NULL,
  zero.length.return = "",
  silent = TRUE,
  ...
)

Arguments

i

Iteration of the loop.

total

Total number of iterations.

file

The file name of that iteration to print out.

...

ignored

path

A character vector, containing path names.

expr

An expression to run.

max.time

Max allotted elapsed run time in seconds.

timeout.return

Value to return for timeouts.

zero.length.return

Value to return for length zero expression evaluations.

silent

logical. If TRUE report of error messages.

Value

loop_counter - Prints loop information.

base_name - Returns just the basename of the path.

Examples

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
## Not run: 
files <- dir(
    system.file("docs", package = "textreadr"),
    full.names = TRUE, 
    recursive = TRUE, 
    pattern = '\\.(R?md|Rd?$|txt|sql|html|pdf|doc|ppt|tex)'
)

max_wait <- 30
total <- length(files)
content <- vector(mode = "list", total)

for (i in seq_along(files)){

    loop_counter(i, total, base_name(files[i]))

    content[[i]] <- try_limit(
        textreadr::read_document(files[i]), 
        max.time = max_wait, 
        zero.length.return = NA
    )
}


sapply(content, is.null)
sapply(content, function(x) length(x) == 1 && is.na(x))
content

## End(Not run)

textreadr documentation built on Oct. 9, 2021, 5:06 p.m.