knitr::opts_chunk$set( collapse = TRUE, comment = "#>" )
There are several "helper" functions which can simplify the definition of complex patterns. First we define some functions that will help us display the patterns:
one.pattern <- function(pat){ if(is.character(pat)){ pat }else{ nc::var_args_list(pat)[["pattern"]] } } show.patterns <- function(...){ L <- list(...) str(lapply(L, one.pattern)) }
nc::field
for reducing repetitionThe nc::field
function can be used to avoid repetition when defining
patterns of the form variable: value
. The example below shows three
(mostly) equivalent ways to write a regex that captures the text after
the colon and space; the captured text is stored in the variable
group or output column:
show.patterns( "variable: (?<variable>.*)", #repetitive regex string list("variable: ", variable=".*"),#repetitive nc R code nc::field("variable", ": ", ".*"))#helper function avoids repetition
Note that the first version above has a named capture group, whereas the second and third patterns generated by nc have an un-named capture group and some non-capturing groups (but they all match the same pattern).
Another example:
show.patterns( "Alignment (?<Alignment>[0-9]+)", list("Alignment ", Alignment="[0-9]+"), nc::field("Alignment", " ", "[0-9]+"))
Another example:
show.patterns( "Chromosome:\t+(?<Chromosome>.*)", list("Chromosome:\t+", Chromosome=".*"), nc::field("Chromosome", ":\t+", ".*"))
nc::quantifier
for fewer parenthesesAnother helper function is nc::quantifier
which makes patterns
easier to read by reducing the number of parentheses required to
define sub-patterns with quantifiers. For example all three patterns
below create an optional non-capturing group which contains a named
capture group:
show.patterns( "(?:-(?<chromEnd>[0-9]+))?", #regex string list(list("-", chromEnd="[0-9]+"), "?"), #nc pattern using lists nc::quantifier("-", chromEnd="[0-9]+", "?"))#quantifier helper function
Another example with a named capture group inside an optional non-capturing group:
show.patterns( "(?: (?<name>[^,}]+))?", list(list(" ", name="[^,}]+"), "?"), nc::quantifier(" ", name="[^,}]+", "?"))
nc::alternatives
for simplified alternationWe also provide a helper function for defining regex patterns with alternation. The following three lines are equivalent.
show.patterns( "(?:(?<first>bar+)|(?<second>fo+))", list(first="bar+", "|", second="fo+"), nc::alternatives(first="bar+", second="fo+"))
nc::alternatives_with_shared_groups
for alternatives with identical named sub-pattern groupsSometimes each alternative is just a re-arrangement of the same sub-patterns. For example consider the following subjects, each of which are dates, in one of two formats.
subject.vec <- c("mar 17, 1983", "26 sep 2017", "17 mar 1984")
In each of the two formats, the month consists of three lower-case letters, the day consists of two digits, and the year consists of four digits. Is there a single pattern that can match each of these subjects? Yes, such a pattern can be defined using the code below,
pattern <- nc::alternatives_with_shared_groups( month="[a-z]{3}", day=list("[0-9]{2}", as.integer), year=list("[0-9]{4}", as.integer), list(american=list(month, " ", day, ", ", year)), list(european=list(day, " ", month, " ", year)))
In the code above, we used nc::alternatives_with_shared_groups
,
which requires two kinds of arguments:
The pattern can be used for matching, and the result is a data table with one column for each unique name,
(match.dt <- nc::capture_first_vec(subject.vec, pattern))
After having parsed the dates into these three columns, we can add a date column:
Sys.setlocale(locale="C")#to recognize months in English. match.dt[, date := data.table::as.IDate( paste(month, day, year), format="%b %d %Y")] print(match.dt, class=TRUE)
Another example is parsing given and family names, in two different formats:
nc::capture_first_vec( c("Toby Dylan Hocking","Hocking, Toby Dylan"), nc::alternatives_with_shared_groups( family="[A-Z][a-z]+", given="[^,]+", list(given_first=list(given, " ", family)), list(family_first=list(family, ", ", given)) ) )
Any scripts or data that you put into this service are public.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.