explode_dir: Helper functions for file input and output
In opm: Analysing Phenotype Microarray and Growth Curve Data

Description Usage Arguments Details Value See Also Examples

Batch-collect information from a series of input files or batch-convert data from input files to data in output files. Alternatively, turn a mixed file/directory list into a list of files or create a regular expression matching certain file extensions, or convert a wildcard pattern to a regular expression, or split files. These functions are not normally directly called by an opm user but by the other IO functions of the package such as collect_template or batch_opm. One can use their demo argument directly for testing the results of the applied file name patterns.

  explode_dir(names, include = NULL, exclude = NULL,
    ignore.case = TRUE, wildcard = TRUE, recursive = TRUE,
    missing.error = TRUE, remove.dups = TRUE)

  batch_collect(names, fun, fun.args = list(), proc = 1L,
    ..., use.names = TRUE, simplify = FALSE, demo = FALSE)

  batch_process(names, out.ext, io.fun, fun.args = list(),
    proc = 1L, outdir = NULL,
    overwrite = c("yes", "older", "no"), in.ext = "any",
    compressed = TRUE,
    literally = inherits(in.ext, "AsIs"), ...,
    verbose = TRUE, demo = FALSE)

  file_pattern(type = c("both", "csv", "yaml", "json", "yorj", "lims", "nolims",
    "any", "empty"),
    compressed = TRUE, literally = inherits(type, "AsIs"))

  split_files(files, pattern, outdir = "", demo = FALSE,
    single = TRUE, wildcard = FALSE, invert = FALSE,
    include = TRUE, format = opm_opt("file.split.tmpl"),
    compressed = TRUE, ...)

  glob_to_regex(object)

  ## S3 method for class 'character'
 glob_to_regex(object)

  ## S3 method for class 'factor'
 glob_to_regex(object)

`names`	Character vector containing file names or directories, or convertible to such.
`object`	Character vector or factor.
`include`	If a character scalar, used as regular expression or wildcard (see the `wildcard` argument) for selecting from the input files. If `NULL`, ignored. If a list, used as arguments of `file_pattern` and its result used as regular expression. Note that selection is done after expanding the directory names to file names. For `split_files` a logical scalar. `TRUE` means to also include the separator lines in the output files.
`exclude`	Like `include`, but for excluding matching input files. Note that exclusion is done after applying `include`.
`ignore.case`	Logical scalar. Ignore differences between uppercase and lowercase when using `include` and `exclude`? Has no effect for `NULL` values for `include` or `exclude`, respectively.
`wildcard`	Logical scalar. Are `include`, `exclude` or `pattern` wildcards (as used by UNIX shells) that first need to be concerted to regular expressions? Has no effect if lists are used for `include` or `exclude`, respectively. See below for details on such wildcards (a.k.a. globbing patterns).
`recursive`	Logical scalar. Traverse directories recursively and also consider all subdirectories? See `list.files` from the base package for details.
`missing.error`	Logical scalar. If a file/directory does not exist, raise an error or only a warning?
`remove.dups`	Logical scalar. Remove duplicates from `names`? Note that if requested this is done before expanding the names of directories, if any.
`fun`	Collecting function. Should use the file name as first argument.
`fun.args`	Optional list of arguments to `fun` or `io.fun`.
`...`	Optional further arguments passed from `batch_process` or `batch_collect` to `explode_dir`. For `split_files`, optional arguments passed to `grepl`, which is used for matching the separator lines. See also `invert` listed above.
`proc`	Integer scalar. The number of processes to spawn. Cannot be set to more than 1 core if running under Windows. See the `cores` argument of `do_aggr` for details.
`simplify`	Logical scalar. Should the resulting list be simplified to a vector or matrix if possible?
`use.names`	Logical scalar. Should `names` be used for naming the elements of the result?
`out.ext`	Character scalar. The extension of the output file names (without the dot).
`outdir`	Character vector. Directories in which to place the output files. If empty or only containing empty strings, the directory of each input file is used.
`in.ext`	Character scalar. Passed through `file_pattern`, then used for the replacement of old file extensions with new ones.
`type`	Character scalar indicating the file types to be matched by extension. For historical reasons, `both` means either CSV or YAML or JSON, whereas `yorj` means either YAML or JSON. CSV also includes the LIMS CSV format introduced in 2014, which can be specifically selected using `lims` or excluded using `nolims`. Alternatively, directly the extension or extensions, or a list of file names (not `NA`).
`compressed`	Logical scalar. Shall compressed files also be matched? This affects the returned pattern as well as the pattern used for extracting file extensions from complete file names (if `literally` is `TRUE`). `split_files` passes this argument to `file_pattern`, but here it only affects the way file names are split in extensions and base names. Should only be set to `FALSE` if input files are not compressed (and have according file extensions).
`literally`	Logical scalar. Interpret `type` literally? This also allows for vectors with more than a single element, as well as the extraction of file extensions from file names.
`demo`	Logical scalar. In the case of `batch_process`, if `TRUE` do not convert files, but print the attempted input file-output file conversions and invisibly return a matrix with input files in the first and output files in the second column? For the other functions, the effect is equivalent. For `split_files`, do not create files, just return the usual list containing all potentially created files. Note that in contrast to the `demo` arguments of other IO functions, this requires the input files to be read.
`files`	Character vector or convertible to such. Names of the files to be split. In contrast to functions such as `read_opm`, names of directories are not supported (will not be expanded to lists of files).
`pattern`	Regular expression or shell globbing pattern for matching the separator lines if `invert` is `FALSE` (the default) or matching the non-separator lines if otherwise. Conceptually each of the sections into which a file is split comprises a separator line followed by non-separator lines. That is, separator lines followed by another separator line are ignored. Non-separator lines not preceded by a separator line are treated as a section of their own, however.
`single`	Logical scalar. If there is only one group per file, i.e. only one output file would result from the splitting, create this file anyway? Such cases could be recognised by empty character vectors as values of the returned list (see below).
`invert`	Logical scalar. Invert pattern matching, i.e. treat all lines that not match `pattern` as separators?
`format`	Character scalar determining the output file name format. It is passed to `sprintf` and expects three placeholders: the base name of the file; the index of the section; the file extension. Getting `format` wrong might result in non-unique file names and thus probably in overwritten files; accordingly, it should be used with care.
`io.fun`	Conversion function. Should accept `infile` and `outfile` as the first two arguments.
`overwrite`	Character scalar. If ‘yes’, conversion is always tried if `infile` exists and is not empty. If ‘no’, conversion is not tried if `outfile` exists and is not empty. If ‘older’, conversion is tried if `outfile` does not exist or is empty or is older than `infile` (with respect to the modification time).
`verbose`	Logical scalar. Print conversion and success/failure information?

Other functions that call explode_dir have a demo argument which, if set to TRUE, caused the respective function to do no real work but print the names of the files that it would process in normal running mode.

glob_to_regex changes a shell globbing wildcard into a regular expression. This is just a slightly extended version of glob2rx from the utils package, but more conversion steps might need to be added here in the future. Particularly explode_dir and the IO functions calling that function internally use glob_to_regex. Some hints when using globbing patterns are given in the following.

The here used globbing search patterns contain only two special characters, ‘?’ and ‘*’, and are thus more easy to master than regular expressions. ‘?’ matches a single arbitrary character, whereas ‘*’ matches zero to an arbitrary number of arbitrary characters. Some examples:

a?c: Matches abc, axc, a c etc. but not abbc, abbbc, ac etc.
a*c: Matches abc, abbc, ac etc. but not abd etc.
ab*: Matches abc, abcdefg, abXYZ etc. but not acdefg etc.
?bc: Matches abc, Xbc, bc etc. but not aabc, abbc, bc etc.

Despite their simplicity, globbing patterns are often sufficient for selecting file names.

split_files subdivides each file into sections which are written individually to newly generated files. Sections are determined with patterns that match the start of a section. This function might be useful for splitting OmniLog(R) multiple-plate CSV files before inputting them with read_opm, even though that function could also input such files directly. It is used in one of the running modes of by batch_opm for splitting files. See also the ‘Examples’. The newly generated files are numbered accordingly; they are not named after any csv_data entry because there is no guarantee that it is present.

explode_dir returns a character vector (which would be empty if all existing files, if any, had been unselected).

batch_collect returns a list, potentially simplified to a vector, depending on the output of fun and the value of simplify. See also demo.

In normal mode, batch_process creates an invisibly returned character matrix in which each row corresponds to a named character vector with the keys infile, outfile, before and after. The latter two describe the result of the action(s) before and after attempting to convert infile to outfile. The after entry is the empty string if no conversion was tried (see overwrite), ok if conversion was successful and a message describing the problems otherwise. For the results of the demo mode see above.

file_pattern yields a character scalar, holding a regular expression. glob_to_regex yields a vector of regular expressions.

split_files creates a list of character vectors, each vector containing the names of the newly generated files. The names of the list are the input file names. The list is returned invisibly.

base::list.files base::Sys.glob utils::glob2rx base::regex base::split base::strsplit base::file.rename

Other io-functions: batch_opm, collect_template, read_opm, read_single_opm, to_metadata

# explode_dir()
# Example with temporary directory
td <- tempdir()
tf <- tempfile()
(x <- explode_dir(td))
write(letters, tf)
(y <- explode_dir(td))
stopifnot(length(y) > length(x))
unlink(tf)
(y <- explode_dir(td))
stopifnot(length(y) == length(x))

# Example with R installation directory
(x <- explode_dir(R.home(), include = "*/doc/html/*"))
(y <- explode_dir(R.home(), include = "*/doc/html/*", exclude = "*.html"))
stopifnot(length(x) == 0L || length(x) > length(y))

# batch_collect()
# Read the first line from each of the OPM test data set files
f <- opm_files("testdata")
if (length(f) > 0) { # if the files are found
  x <- batch_collect(f, fun = readLines, fun.args = list(n = 1L))
  # yields a list with the input files as names and the result from each
  # file as values (exactly one line)
  stopifnot(is.list(x), identical(names(x), f))
  stopifnot(sapply(x, is.character), sapply(x, length) == 1L)
} else {
  warning("test files not found")
}
# For serious tasks, consider to first try the function in 'demo' mode.

# batch_process()
# Read the first line from each of the OPM test data set files and store it
# in temporary files
pf <- function(infile, outfile) write(readLines(infile, n = 1), outfile)
infiles <- opm_files("testdata")
if (length(infiles) > 0) { # if the files are found
  x <- batch_process(infiles, out.ext = "tmp", io.fun = pf,
    outdir = tempdir())
  stopifnot(is.matrix(x), identical(x[, 1], infiles))
  stopifnot(file.exists(x[, 2]))
  unlink(x[, 2])
} else {
  warning("test files not found")
}
# For serious tasks, consider to first try the function in 'demo' mode.

# file_pattern()
(x <- file_pattern())
(y <- file_pattern(type = "csv", compressed = FALSE))
stopifnot(nchar(x) > nchar(y))
# constructing pattern from existing files
(files <- list.files(pattern = "[.]"))
(x <- file_pattern(I(files))) # I() causes 'literally' to be TRUE
stopifnot(grepl(x, files, ignore.case = TRUE))

# glob_to_regex()
x <- "*what glob2rx() can't handle because a '+' is included*"
(y <- glob_to_regex(x))
(z <- glob2rx(x))
stopifnot(!identical(y, z))
# Factor method
(z <- glob_to_regex(as.factor(x)))
stopifnot(identical(as.factor(y), z))

## split_files()

# Splitting an old-style CSV file containing several plates
(x <- opm_files("multiple"))
if (length(x) > 0) {
  outdir <- tempdir()
  # For old-style CSV, use either "^Data File" as pattern or "Data File*"
  # with 'wildcard' set to TRUE:
  (result <- split_files(x, pattern = "^Data File", outdir = outdir))
  stopifnot(is.list(result), length(result) == length(x))
  stopifnot(sapply(result, length) == 3)
  result <- unlist(result)
  stopifnot(file.exists(result))
  unlink(result) # tidy up
} else {
  warning("opm example files not found")
}
## One could split new-style CSV as follows (if x is a vector of file names):
# split_files(x, pattern = '^"Data File",')
## note the correct setting of the quotes
## A pattern that covers both old and new-style CSV is:
# split_files(x, pattern = '^("Data File",|Data File)')
## This is used by batch_opm() in 'split' mode any by the 'run_opm.R' script