This vignette covers the old three argument syntax used in the
*_named
functions; see the other vignette "recommended variable
argument syntax" for documentation about the *_variable
functions.
knitr::opts_chunk$set( collapse = TRUE, comment = "#>" )
The str_match_named
and str_match_all_named
functions take exactly three arguments as input:
- subject
is the character vector from which we want to extract
tabular data.
- pattern
is the (character scalar) regular expression with named
capture groups used for extraction. The named capture groups should be literally specified, e.g. (?P<group_name>subpattern)
- fun.list
is a list with names that correspond to capture groups,
and values are functions used to convert the extracted character
data to other (typically numeric) types.
For example, consider the character vector of subject strings below:
subject.vec <- c( "chr10:213,054,000-213,055,000", "chrM:111,000", "this will not match", NA, # neither will this. "chr1:110-111 chr2:220-222") # two possible matches.
With these genomic range subjects, the goal is to extract the chromosome name (before the colon), along with the numeric start and end locations (separated by a dash). That pattern is coded below,
chr.pos.pattern <- paste0( "(?P<chrom>chr.*?)", ":", "(?P<chromStart>[0-9,]+)", "(?:", "-", "(?P<chromEnd>[0-9,]+)", ")?")
Note that it is often preferable (as above) to code the pattern using paste0, in order to put different parts of the pattern on each line. In particular, I recommend putting each named capture group on its own line -- this is one of the ideas that is used in the variable argument syntax (see other vignette for more info).
Using the pattern on the subjects above results in
(match.mat <- namedCapture::str_match_named(subject.vec, chr.pos.pattern)) str(match.mat)
Note that the third argument (list of conversion functions) is omitted in the code above. In that case, the return value is a character matrix, in which missing values indicate missing subjects or no match. The empty string is used for optional groups which are not used in the match (e.g. chromEnd group/column for second subject).
However we often want to extract numeric data -- in this case we want to convert chromStart/End to integers. You can do that by supplying a named list of conversion functions as the third argument. Each function should take exactly one argument, a character vector (data in the matched column/group), and return a vector of the same size. The code below specifies the keep.digits
function for both chromStart
and chromEnd
.
keep.digits <- function(x)as.integer(gsub("[^0-9]", "", x)) conversion.list <- list(chromStart=keep.digits, chromEnd=keep.digits) (match.df <- namedCapture::str_match_named( subject.vec, chr.pos.pattern, conversion.list)) str(match.df)
Note that a data.frame
is returned when the third argument is specified, in order to handle non-character data types returned by the conversion functions.
Note in the examples above that the last subject has two possible
matches, but only the first is returned by str_match_named
. Use
str_match_all_named
to get ALL matches in each subject (not just the
first match).
namedCapture::str_match_all_named( subject.vec, chr.pos.pattern, conversion.list)
As shown above, the result is a list with one element for each subject. Each list element is a data.frame with one row for each match.
If the pattern specifies the name
group, then it will be used for
the rownames of the output, and it will not be included as a column. For example the pattern below uses name
for the first group:
name.pattern <- paste0( "(?P<name>chr.*?)", ":", "(?P<chromStart>[0-9,]+)", "(?:", "-", "(?P<chromEnd>[0-9,]+)", ")?") try(named.mat <- namedCapture::str_match_named( subject.vec, name.pattern, conversion.list)) (named.mat <- namedCapture::str_match_named( subject.vec[-(3:4)], name.pattern, conversion.list)) (named.list <- namedCapture::str_match_all_named( subject.vec, name.pattern, conversion.list))
Note in the above code we use try
because it is an error if any
name
groups are missing (and they are for the subjects 3 and 4).
The named output feature makes it easy to select particular elements of the extracted data by name, e.g.
named.mat["chr1", "chromStart"] named.list[[5]]["chr2", "chromStart"]
Note that if the subject is named, its names will be used to name the output (rownames or list names).
named.subject.vec <- c( ten="chr10:213,054,000-213,055,000", M="chrM:111,000", nomatch="this will not match", missing=NA, # neither will this. two="chr1:110-111 chr2:220-222") # two possible matches. namedCapture::str_match_named( named.subject.vec, chr.pos.pattern, conversion.list) namedCapture::str_match_all_named( named.subject.vec, chr.pos.pattern, conversion.list)
If the subject has names, and the name
group is specified, then the subject names are used to name the output (and the name
column is included in the output).
named.subject.vec <- c( ten="chr10:213,054,000-213,055,000", M="chrM:111,000", nomatch="this will not match", missing=NA, # neither will this. two="chr1:110-111 chr2:220-222") # two possible matches. namedCapture::str_match_named( named.subject.vec, name.pattern, conversion.list) namedCapture::str_match_all_named( named.subject.vec, name.pattern, conversion.list)
Next, read the "recommended variable argument syntax" vignette for
information about how to use the *_variable
functions.
Any scripts or data that you put into this service are public.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.