In tdhock/namedCapture: Named Capture Regular Expressions

Recommended variable argument syntax

knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)

This is the second vignette -- we assume you have already read the "three argument syntax" vignette which covers the most basic namedCapture functions, str_match_named and str_match_all_named. Here we introduce the syntax used in the namedCapture::*_variable functions, which is motivated by the desire to avoid repetitive/boilerplate code.

Extract the first match from each subject

In the previous vignette we used the following code to extract the first match from each subject,

subject.vec <- c(
  "chr10:213,054,000-213,055,000",
  "chrM:111,000",
  "this will not match",
  NA, # neither will this.
  "chr1:110-111 chr2:220-222") # two possible matches.
## Single line pattern, not so easy to read.
single.line.pattern <-
  "(?P<chrom>chr.*?):(?P<chromStart>[0-9,]+)(?:-(?P<chromEnd>[0-9,]+))?"
## Same pattern defined over multiple lines, easier to read.
chr.pos.pattern <- paste0(
  "(?P<chrom>chr.*?)",
  ":",
  "(?P<chromStart>[0-9,]+)",
  "(?:",
    "-",
    "(?P<chromEnd>[0-9,]+)",
  ")?")
identical(single.line.pattern, chr.pos.pattern)
namedCapture::str_match_named(subject.vec, chr.pos.pattern)

Note that the pattern above is defined using the paste0 boilerplate, which is used to break the pattern over several lines for clarity. Using the variable argument syntax, we can omit paste0, and simply supply the pattern strings to str_match_variable directly,

namedCapture::str_match_variable(
  subject.vec, 
  "(?P<chrom>chr.*?)",
  ":",
  "(?P<chromStart>[0-9,]+)",
  "(?:",
    "-",
    "(?P<chromEnd>[0-9,]+)",
  ")?")

We can further simplify by removing the named capture groups from the strings, and adding names to the corresponding arguments. For name1="pattern1", namedCapture internally generates/uses the regex (?P<name1>pattern1).

namedCapture::str_match_variable(
  subject.vec, 
  chrom="chr.*?",
  ":",
  chromStart="[0-9,]+",
  "(?:",
    "-",
    chromEnd="[0-9,]+",
  ")?")

We can add type conversion functions on the same line as the definition of the named group:

keep.digits <- function(x)as.integer(gsub("[^0-9]", "", x))
(match.df <- namedCapture::str_match_variable(
  subject.vec, 
  chrom="chr.*?",
  ":",
  chromStart="[0-9,]+", keep.digits,
  "(?:",
    "-",
    chromEnd="[0-9,]+", keep.digits,
  ")?"))

Note the repetition in the chromStart/End lines -- the same pattern and type conversion function is used for each group. This repetition can be avoided by creating and using a sub-pattern list variable,

pos.pattern <- list("[0-9,]+", keep.digits)
namedCapture::str_match_variable(
  subject.vec, 
  chrom="chr.*?",
  ":",
  chromStart=pos.pattern,
  "(?:",
    "-",
    chromEnd=pos.pattern,
  ")?")

Finally, the non-capturing group can be replaced by an un-named list:

namedCapture::str_match_variable(
  subject.vec, 
  chrom="chr.*?",
  ":",
  chromStart=pos.pattern,
  list(
    "-",
    chromEnd=pos.pattern
  ), "?")

In summary, the str_match_variable function takes a variable number of arguments, and allows for a shorter, less repetitive, and thus more user-friendly syntax:

The first argument is the subject character vector.
The other arguments specify the pattern, via character strings, functions, and/or lists.
If a pattern (character/list) is named, we use the argument name in R for the capture group name in the regex.
Each function is used to convert the text extracted by the previous named pattern argument. (type conversion can only be used with named R arguments, NOT with explicitly specified named groups in regex strings)
Lists may be used to avoid repetition in the definition of the pattern and type conversion functions.
Each list generates a group in the regex (named list => named capture group, un-named list => non-capturing group).
All patterns are pasted together in the order that they appear in the argument list.

View generated regex

To see the regular expression pattern string generated by the namedCapture::*_variable functions, call variable_args_list with the variable number of arguments that specify the pattern:

(L <- namedCapture::variable_args_list(
  chrom="chr.*?",
  ":",
  chromStart=pos.pattern,
  list(
    "-",
    chromEnd=pos.pattern
  ), "?"))
identical(L$pattern, single.line.pattern)

The generated regex is the pattern element of the resulting list above (which is internally passed to namedCapture::*_named). Note how the generated regex is identical to the regex we defined above using a character string literal; the advantage of namedCapture::*_variable functions is that the regex is much easier to read/understand/edit.

Error if any subjects do not match

Sometimes you want to stop with an error (instead of reporting a row of NA) when a subject does not match. In that case, use nomatch.error=TRUE:

namedCapture::str_match_variable(
  subject.vec, 
  chrom="chr.*?",
  ":",
  chromStart=pos.pattern,
  list(
    "-",
    chromEnd=pos.pattern
  ), "?",
  nomatch.error=TRUE)

Extract all matches from a multi-line text file subject

The variable argument syntax can also be used with str_match_all_variable, which is for the common case of extracting each match from a multi-line text file. In this section we demonstrate how to use str_match_all_variable to extract data.frames from a loosely structured text file.

trackDb.txt.gz <- system.file(
  "extdata", "trackDb.txt.gz", package="namedCapture")
trackDb.vec <- readLines(trackDb.txt.gz)

Some representative lines from that file are shown below.

cat(trackDb.vec[78:107], sep="\n")

Each block of text begins with "track" and includes several lines of data before the block ends with two consecutive newlines. That pattern is coded below using a regex:

fields.df <- namedCapture::str_match_all_variable(
  trackDb.vec,
  "track ",
  name="\\S+",
  fields="(?:\n[^\n]+)*",
  "\n")

Note that this function assumes that its first argument is a character vector with one element for each line in a file. Therefore the result contains no information about which subject element each match comes from (to get that, use str_match_all_named). The code above creates a data frame with one row for each track block, with rownames given by the track line (because of the capture group named name), and one fields column which is a string with the rest of the data in that block.

head(fields.df)

Each block has a variable number of lines/fields. Each line starts with a field name, followed by a space, followed by the field value. That regex is coded below:

fields.list <- namedCapture::str_match_all_named(
  fields.df[, "fields"], paste0(
    "\\s+",
    "(?P<name>.*?)",
    " ",
    "(?P<value>[^\n]+)"))

Note that we used str_match_all_named which outputs a list in order to keep info about which match came from which subject. The result is a list of data frames.

fields.list[12:14]

There is a list element for each block, named by track. Each list element is a data frame with one row per field defined in that block (rownames are field names). The names/rownames make it easy to write R code that selects individual elements by name, e.g.

fields.list$bcell_McGill0091Coverage["bigDataUrl",]
fields.list$monocyte_McGill0001Peaks["color",]
has.bigDataUrl <- sapply(fields.list, function(m)"bigDataUrl" %in% rownames(m))
bigDataUrl.list <- fields.list[has.bigDataUrl]
length(bigDataUrl.list)
length(fields.list)

So there are 78 tracks which define the bigDataUrl field, out of 123 total tracks.

In the example above we extracted all fields from all tracks (using two regexes, one for the track, one for the field). In the example below we extract only the bigDataUrl field for each track, and split sample names into separate columns (using a single regex for the track). It also demonstrates how to use nested named capture groups (via named lists which contain named regex strings).

name.pattern <- list(
  cellType=".*?",
  "_",
  sampleName=list(
    "McGill",
    sampleID="[0-9]+", as.integer),
  dataType="Coverage|Peaks",
  "|",
  "[^\n]+")
match.df <- namedCapture::str_match_all_variable(
  trackDb.vec,
  "track ",
  name=name.pattern,
  "(?:\n[^\n]+)*",
  "\\s+bigDataUrl ",
  bigDataUrl="[^\n]+")
head(match.df)

Exercise for the reader: modify the above regex in order to capture three additional columns (red, green, blue) from the color field.

Extract several columns of a data frame

We also provide namedCapture::df_match_variable which extracts text from several columns of a data.frame, using a different named capture regular expression for each column.

It requires a data.frame as the first argument.
It takes a variable number of other arguments, all of which must be named. For each other argument we call str_match_variable on one column of the input data.frame.
Each argument name specifies a column of the data.frame which will be used as the subject in str_match_variable.
Each argument value specifies a pattern to be used with str_match_variable, in list/character/function format as explained in the previous section.
The return value is a data.frame with the same number of rows as the input, but with an additional column for each named capture group. New columns are named using the convention subjectColumnName.groupName.
This is a "tidy" function that can be used in a pipe. This function can greatly simplify the code required to create numeric data columns from character data columns. For example consider the following data which was output from the sacct program.

(sacct.df <- data.frame(
  Elapsed = c(
    "07:04:42", "07:04:42", "07:04:49",
    "00:00:00", "00:00:00"),
  JobID=c(
    "13937810_25",
    "13937810_25.batch",
    "13937810_25.extern",
    "14022192_[1-3]",
    "14022204_[4]"),
  stringsAsFactors=FALSE))

Say we want to filter by the total Elapsed time (which is reported as hours:minutes:seconds), and base job id (which is the number before the underscore in the JobID column). We could start by converting those character columns to integers via:

## Define some sub-patterns separately for clarity.
range.pattern <- list(
  "[[]",
  task1="[0-9]+", as.integer,
  "(?:-",#begin optional end of range.
  taskN="[0-9]+", as.integer,
  ")?", #end is optional.
  "[]]")
task.pattern <- list(
  "(?:",#begin alternate
  task="[0-9]+", as.integer,
  "|",#either one task(above) or range(below)
  range.pattern,
  ")")#end alternate
(task.df <- namedCapture::df_match_variable(
  sacct.df,
  JobID=list(
    job="[0-9]+", as.integer,
    "_",
    task.pattern,
    "(?:[.]",
    type=".*",
    ")?"),
  Elapsed=list(
    hours="[0-9]+", as.integer,
    ":",
    minutes="[0-9]+", as.integer,
    ":",
    seconds="[0-9]+", as.integer)))

The result is another data frame with an additional column for each named capture group. Note that this also works with data.table:

library(data.table)
sacct.dt <- data.table(sacct.df)
(task.dt <- namedCapture::df_match_variable(
  sacct.dt,
  JobID=list(
    job="[0-9]+", as.integer,
    "_",
    task.pattern,
    "(?:[.]",
    type=".*",
    ")?"),
  Elapsed=list(
    hours="[0-9]+", as.integer,
    ":",
    minutes="[0-9]+", as.integer,
    ":",
    seconds="[0-9]+", as.integer)))