Recommended variable argument syntax

knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)

This is the second vignette -- we assume you have already read the "three argument syntax" vignette which covers the most basic namedCapture functions, str_match_named and str_match_all_named. Here we introduce the syntax used in the namedCapture::*_variable functions, which is motivated by the desire to avoid repetitive/boilerplate code.

Extract the first match from each subject

In the previous vignette we used the following code to extract the first match from each subject,

subject.vec <- c(
  "chr10:213,054,000-213,055,000",
  "chrM:111,000",
  "this will not match",
  NA, # neither will this.
  "chr1:110-111 chr2:220-222") # two possible matches.
## Single line pattern, not so easy to read.
single.line.pattern <-
  "(?P<chrom>chr.*?):(?P<chromStart>[0-9,]+)(?:-(?P<chromEnd>[0-9,]+))?"
## Same pattern defined over multiple lines, easier to read.
chr.pos.pattern <- paste0(
  "(?P<chrom>chr.*?)",
  ":",
  "(?P<chromStart>[0-9,]+)",
  "(?:",
    "-",
    "(?P<chromEnd>[0-9,]+)",
  ")?")
identical(single.line.pattern, chr.pos.pattern)
namedCapture::str_match_named(subject.vec, chr.pos.pattern)

Note that the pattern above is defined using the paste0 boilerplate, which is used to break the pattern over several lines for clarity. Using the variable argument syntax, we can omit paste0, and simply supply the pattern strings to str_match_variable directly,

namedCapture::str_match_variable(
  subject.vec, 
  "(?P<chrom>chr.*?)",
  ":",
  "(?P<chromStart>[0-9,]+)",
  "(?:",
    "-",
    "(?P<chromEnd>[0-9,]+)",
  ")?")

We can further simplify by removing the named capture groups from the strings, and adding names to the corresponding arguments. For name1="pattern1", namedCapture internally generates/uses the regex (?P<name1>pattern1).

namedCapture::str_match_variable(
  subject.vec, 
  chrom="chr.*?",
  ":",
  chromStart="[0-9,]+",
  "(?:",
    "-",
    chromEnd="[0-9,]+",
  ")?")

We can add type conversion functions on the same line as the definition of the named group:

keep.digits <- function(x)as.integer(gsub("[^0-9]", "", x))
(match.df <- namedCapture::str_match_variable(
  subject.vec, 
  chrom="chr.*?",
  ":",
  chromStart="[0-9,]+", keep.digits,
  "(?:",
    "-",
    chromEnd="[0-9,]+", keep.digits,
  ")?"))

Note the repetition in the chromStart/End lines -- the same pattern and type conversion function is used for each group. This repetition can be avoided by creating and using a sub-pattern list variable,

pos.pattern <- list("[0-9,]+", keep.digits)
namedCapture::str_match_variable(
  subject.vec, 
  chrom="chr.*?",
  ":",
  chromStart=pos.pattern,
  "(?:",
    "-",
    chromEnd=pos.pattern,
  ")?")

Finally, the non-capturing group can be replaced by an un-named list:

namedCapture::str_match_variable(
  subject.vec, 
  chrom="chr.*?",
  ":",
  chromStart=pos.pattern,
  list(
    "-",
    chromEnd=pos.pattern
  ), "?")

In summary, the str_match_variable function takes a variable number of arguments, and allows for a shorter, less repetitive, and thus more user-friendly syntax:

View generated regex

To see the regular expression pattern string generated by the namedCapture::*_variable functions, call variable_args_list with the variable number of arguments that specify the pattern:

(L <- namedCapture::variable_args_list(
  chrom="chr.*?",
  ":",
  chromStart=pos.pattern,
  list(
    "-",
    chromEnd=pos.pattern
  ), "?"))
identical(L$pattern, single.line.pattern)

The generated regex is the pattern element of the resulting list above (which is internally passed to namedCapture::*_named). Note how the generated regex is identical to the regex we defined above using a character string literal; the advantage of namedCapture::*_variable functions is that the regex is much easier to read/understand/edit.

Error if any subjects do not match

Sometimes you want to stop with an error (instead of reporting a row of NA) when a subject does not match. In that case, use nomatch.error=TRUE:

namedCapture::str_match_variable(
  subject.vec, 
  chrom="chr.*?",
  ":",
  chromStart=pos.pattern,
  list(
    "-",
    chromEnd=pos.pattern
  ), "?",
  nomatch.error=TRUE)

Extract all matches from a multi-line text file subject

The variable argument syntax can also be used with str_match_all_variable, which is for the common case of extracting each match from a multi-line text file. In this section we demonstrate how to use str_match_all_variable to extract data.frames from a loosely structured text file.

trackDb.txt.gz <- system.file(
  "extdata", "trackDb.txt.gz", package="namedCapture")
trackDb.vec <- readLines(trackDb.txt.gz)

Some representative lines from that file are shown below.

cat(trackDb.vec[78:107], sep="\n")

Each block of text begins with "track" and includes several lines of data before the block ends with two consecutive newlines. That pattern is coded below using a regex:

fields.df <- namedCapture::str_match_all_variable(
  trackDb.vec,
  "track ",
  name="\\S+",
  fields="(?:\n[^\n]+)*",
  "\n")

Note that this function assumes that its first argument is a character vector with one element for each line in a file. Therefore the result contains no information about which subject element each match comes from (to get that, use str_match_all_named). The code above creates a data frame with one row for each track block, with rownames given by the track line (because of the capture group named name), and one fields column which is a string with the rest of the data in that block.

head(fields.df)

Each block has a variable number of lines/fields. Each line starts with a field name, followed by a space, followed by the field value. That regex is coded below:

fields.list <- namedCapture::str_match_all_named(
  fields.df[, "fields"], paste0(
    "\\s+",
    "(?P<name>.*?)",
    " ",
    "(?P<value>[^\n]+)"))

Note that we used str_match_all_named which outputs a list in order to keep info about which match came from which subject. The result is a list of data frames.

fields.list[12:14]

There is a list element for each block, named by track. Each list element is a data frame with one row per field defined in that block (rownames are field names). The names/rownames make it easy to write R code that selects individual elements by name, e.g.

fields.list$bcell_McGill0091Coverage["bigDataUrl",]
fields.list$monocyte_McGill0001Peaks["color",]
has.bigDataUrl <- sapply(fields.list, function(m)"bigDataUrl" %in% rownames(m))
bigDataUrl.list <- fields.list[has.bigDataUrl]
length(bigDataUrl.list)
length(fields.list)

So there are 78 tracks which define the bigDataUrl field, out of 123 total tracks.

In the example above we extracted all fields from all tracks (using two regexes, one for the track, one for the field). In the example below we extract only the bigDataUrl field for each track, and split sample names into separate columns (using a single regex for the track). It also demonstrates how to use nested named capture groups (via named lists which contain named regex strings).

name.pattern <- list(
  cellType=".*?",
  "_",
  sampleName=list(
    "McGill",
    sampleID="[0-9]+", as.integer),
  dataType="Coverage|Peaks",
  "|",
  "[^\n]+")
match.df <- namedCapture::str_match_all_variable(
  trackDb.vec,
  "track ",
  name=name.pattern,
  "(?:\n[^\n]+)*",
  "\\s+bigDataUrl ",
  bigDataUrl="[^\n]+")
head(match.df)

Exercise for the reader: modify the above regex in order to capture three additional columns (red, green, blue) from the color field.

Extract several columns of a data frame

We also provide namedCapture::df_match_variable which extracts text from several columns of a data.frame, using a different named capture regular expression for each column.

(sacct.df <- data.frame(
  Elapsed = c(
    "07:04:42", "07:04:42", "07:04:49",
    "00:00:00", "00:00:00"),
  JobID=c(
    "13937810_25",
    "13937810_25.batch",
    "13937810_25.extern",
    "14022192_[1-3]",
    "14022204_[4]"),
  stringsAsFactors=FALSE))

Say we want to filter by the total Elapsed time (which is reported as hours:minutes:seconds), and base job id (which is the number before the underscore in the JobID column). We could start by converting those character columns to integers via:

## Define some sub-patterns separately for clarity.
range.pattern <- list(
  "[[]",
  task1="[0-9]+", as.integer,
  "(?:-",#begin optional end of range.
  taskN="[0-9]+", as.integer,
  ")?", #end is optional.
  "[]]")
task.pattern <- list(
  "(?:",#begin alternate
  task="[0-9]+", as.integer,
  "|",#either one task(above) or range(below)
  range.pattern,
  ")")#end alternate
(task.df <- namedCapture::df_match_variable(
  sacct.df,
  JobID=list(
    job="[0-9]+", as.integer,
    "_",
    task.pattern,
    "(?:[.]",
    type=".*",
    ")?"),
  Elapsed=list(
    hours="[0-9]+", as.integer,
    ":",
    minutes="[0-9]+", as.integer,
    ":",
    seconds="[0-9]+", as.integer)))

The result is another data frame with an additional column for each named capture group. Note that this also works with data.table:

library(data.table)
sacct.dt <- data.table(sacct.df)
(task.dt <- namedCapture::df_match_variable(
  sacct.dt,
  JobID=list(
    job="[0-9]+", as.integer,
    "_",
    task.pattern,
    "(?:[.]",
    type=".*",
    ")?"),
  Elapsed=list(
    hours="[0-9]+", as.integer,
    ":",
    minutes="[0-9]+", as.integer,
    ":",
    seconds="[0-9]+", as.integer)))


tdhock/namedCapture documentation built on Jan. 27, 2024, 9:02 p.m.