knitr::opts_chunk$set( collapse = TRUE, comment = "#>" )
The nc::capture_all_str
function is for the common case of
extracting each match from a multi-line text file (a single large
subject string). In this section we demonstrate how to extract data
tables from such loosely structured text data. For example we consider
the following track
hub
meta-data file:
trackDb.txt.gz <- system.file( "extdata", "trackDb.txt.gz", package="nc") trackDb.vec <- readLines(trackDb.txt.gz)
Some representative lines from that file are shown below.
cat(trackDb.vec[78:107], sep="\n")
Each block of text begins with "track" and includes several lines of data before the block ends with two consecutive newlines. That pattern is coded below using a regex:
tracks.dt <- nc::capture_all_str( trackDb.vec, "track ", track="\\S+", fields="(?:\n[^\n]+)*", "\n") str(tracks.dt)
The result is a data.table with one row for each track block that
matches the regex. There are two character columns: track
is a
unique name, and fields
is a string with the rest of the data
in that block:
tracks.dt[, .(track, fields.start=substr(fields, 1, 30))]
Each block has a variable number of lines/fields. Each line starts with a field name, followed by a space, followed by the field value. That regex is coded below:
(fields.dt <- tracks.dt[, nc::capture_all_str( fields, "\\s+", variable=".*?", " ", value="[^\n]+"), by=track]) str(fields.dt)
Note that because by=track
was specified, nc::capture_all_str
is
called for each unique value of track
(i.e. each row). The results
are combined into a single data.table with one row for each
field. This data.table can be easily queried, e.g.
fields.dt[ J("tcell_McGill0107Coverage", "bigDataUrl"), value, on=.(track, variable)] fields.dt[, .(count=.N), by=variable][order(count)]
For more information about data.table syntax, read
vignette("datatable-intro", package="data.table")
.
In the examples above we extracted all fields from all tracks (using two regexes, one for the track, one for the field). In the example below we extract only the track name, split into separate columns (using a single regex for the track).
cell.sample.type <- list( cellType="[^ ]*?", "_", sampleName=list( "McGill", sampleID="[0-9]+", as.integer), dataType="Coverage|Peaks") nc::capture_all_str(trackDb.vec, cell.sample.type)
Note that the pattern above defines nested capture groups via named lists (e.g. sampleID is a subset of sampleName). The pattern below matches either the previously specified track pattern, or any other type of track name:
sample.or.anything <- list( cell.sample.type, "|", "[^\n]+") track.pattern.old <- list( "track ", track=sample.or.anything) nc::capture_all_str(trackDb.vec, track.pattern.old)
Notice the repetition of track
in the pattern above. This can be
avoided by using the nc::field
helper function, which takes three
arguments, that are pasted together to form a pattern:
field.name
is used as a pattern, and as the capture group
(column) name for the pattern specified in the third argument.between.pattern
is a pattern that matches between the other two patterns.field.pattern
is the pattern that matches the text to be extracted
in a capture group.The example above can thus be re-written as below, avoiding the
repetition of track
which was present above:
track.pattern <- nc::field("track", " ", sample.or.anything) nc::capture_all_str(trackDb.vec, track.pattern)
Finally we use field
again to match the type column:
any.lines.pattern <- "(?:\n[^\n]+)*" nc::capture_all_str( trackDb.vec, track.pattern, any.lines.pattern, "\\s+", nc::field("type", " ", "[^\n]+"))
Exercise for the reader (easy): modify the above regex in order to capture
the bigDataUrl field, and three additional columns (red, green, blue)
from the color field. Assume that bigDataUrl
occurs before color
in each track. Note that this is a limitation of the single regex
approach --- using two regex, as described in previous sections, could
extract any/all fields, even if they appear in different orders in
different tracks.
Exercise for the reader (hard): note that the last code block only
matches tracks which define the type field. How would you optionally
match the type field? Hint: the current any.lines.pattern
can match
the type field.
Thanks to Marc Tollis for providing the example data used in this section (from the SweeD bioinformatics program). Some representative lines from one output file are shown below.
info.txt.gz <- system.file( "extdata", "SweeD_Info.txt.gz", package="nc") info.vec <- readLines(info.txt.gz) info.vec[20:50]
The Alignment numbers must be matched with the numbers before slashes in the other file,
report.txt.gz <- system.file( "extdata", "SweeD_Report.txt.gz", package="nc") report.vec <- readLines(report.txt.gz) cat(report.vec[1:10], sep="\n") cat(report.vec[1000:1010], sep="\n")
The goal is to produce a bed file, which has tab-separated values with four columns: chrom, chromStart, chromEnd, Likelihood. The chrom values appear in the info file (Chromosome) so we will need to join the two files based on alignment ID. First we capture all alignments in the info file:
(info.dt <- nc::capture_all_str( info.vec, "Alignment ", alignment="[0-9]+", "\n\n\t\tChromosome:\t\t", chrom=".*", "\n"))
Then we capture all alignment/csv blocks in the report file:
(report.dt <- nc::capture_all_str( report.vec, "//", alignment="[0-9]+", "\n", csv="[^/]+" )[, { data.table::fread(text=csv) }, by=alignment])
Note that because by=alignment
was specified, fread
is called for
each unique value of alignment
(i.e. each row). The results are
combined into a single data.table with all of the csv data from the
original file, plus the additional alignment
column. Next, we join
this table to the previous table in order to get the chrom
column:
(join.dt <- report.dt[info.dt, on=.(alignment)])
Finally the desired bed table can be created via
join.dt[, .( chrom, chromStart=as.integer(Position-1), chromEnd=as.integer(Position), Likelihood)]
Exercise for the reader (easy): notice that the code above for
creating info.dt
involves repetition in the pattern and group names
(alignment
, Alignment
, chrom
, Chromosome
). Re-write the
pattern using nc::field
in order to eliminate that repetition.
Exercise for the reader (hard): notice that Chromosome is only the
first field -- how could you extract the other fields as well? Hint:
use nc::field
in a helper function in order to avoid repetition.
Any scripts or data that you put into this service are public.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.