knitr::opts_chunk$set( collapse = TRUE, comment = "#>" )
The goal of this package is to simplify the creation of transparent parsers for
structured text files generated by machines like laboratory instruments. For
example, we use the package to construct parsers for files generated by
plate readers. The data generated
by these instruments can usually be exported to text or spreadsheet files. Such
files consist of lines of text organized in higher-order structures like
headers with metadata and blocks of measured values, etc.. It's often
convenient to analyze the data in a program like R. To be able to do that you
have to have a parser that processes these files and creates R-objects as
output. The parcr
package simplifies the task of creating such parsers.
The parsers that are created with this package make extensive use of functional programming. If this topic is new to you then please read about Functional Programming, in particular the chapters Function Factories and Function Operators, in Hadley Wickham's book Advanced R.
The parcr
package contains a set of functions that allow you to create simple
parsers with higher order functions, functions that can take functions as input
and have functions as output. These are sometimes called combinators. The
ideas behind the package are described in a paper by @Hutton1992. A number of
the functions described in this paper are implemented with modifications in the
current package. The package was also heavily inspired by the
Ramble
package which is written in
the same vein, but without the explicit parsing of structured text files in
mind.
list
The parsers constructed with the functions from this package generate a list
as an output. The parsers read the input vector from left to right. When the
parser fails the output will be the empty list list()
. However, if the
parser is successful then it produces a list
with two elements. An element
called L
(the left part) contains the output generated by that part of the
input vector which was successfully parsed, and the element called R
(the right part) which contains the remainder that was not parsed. When an
entire character vector is parsed the content of the R
element equals
character(0)
. The content of the L
part can be shaped to your desire. This
is demonstrated in the example of the fasta file parser later in this
document.
Please realize that every function described below is a higher order function:
their output is a function. In its turn, this function can take a character
vector as its input. For example, literal("a")
yields a function. To
use that function as a parser you have to provide it with a character vector,
the object that needs to be parsed, as input:
literal("a")(c("a","att"))
This parser tests whether the next element in its input is literally the string
"a". It will succeed in the example above, but will only consume the first
element of its input and then stop. However, you can also use a higher order
function like literal("a")
as input to other higher order functions to create
more complex parsers. For example, the function then
takes two parsers p1
and p2
as arguments, then(p1, p2)
and applies them in sequence to the input.
In the parcr
package the function is implemented in the infix form %then%
which makes parser constructs better readable. The composite parser:
literal("a") %then% literal("att")
looks for an element with string "a" followed by an element with string "att". Its application to the same vector:
(literal("a") %then% literal("att"))(c("a","att"))
will completely consume the input. In this way, using a number of standard parsers defined in the package, you can quickly construct flexible parsers taking complex input. Furthermore, the functions also allow you to construct a desired R object as output while parsing.
parcr
packageWe will now discuss all of the parser combinator functions present in the package. You should also study their help pages. In particular the Pseudocode listed for each of them should help you to understand their properties.
The six fundamental parsers allow you to construct a parser that will completely consume input, or to fail when the input does not satisfy the specifications of the parser.
library(parcr)
succeed(o)
: where o
is any kind of R-objectfail()
The succeed
and fail
parsers are the nuts and bolts of a parser
construction. The succeed
parser always succeeds, without consuming any
input, whereas the fail
parser always fails.
The succeed
parser constructs a list
object with a 'left' or L
-element
that contains the parsed result of the consumed part of the input vector and
the 'right' or R
-element that contains the unconsumed part of the vector. The
L
-element can contain any R-object that is constructed during parsing.
While succeed
never fails, fail
always does, regardless of the input
vector. To signal failure it returns a special form of the empty list list()
,
namely a marker
object printed as the icon []
.
Important: It is unlikely that you will ever use these two functions to construct parsers.
Examples:
succeed("A")("abc") succeed(data.frame(title="Keisri hull", author="Jaan Kross"))(c("Unconsumed","text"))
fail()("abc")
The basic functions for recognizing the content of the current element, the left-most element of the input vector.
literal(c)
: tests whether the current element equals the string c
.satisfy(b)
: tests whether the current element satisfies function b()
where
b()
is a logical function: it takes a string as input
and returns TRUE
or FALSE
.eof()
: tests whether the input is at its end.eof()
is a special function that detects the end of a character vector, or if
that character vector represents the lines of text file, the end of the file
(EOF). In fact, it detects character(0)
in the input vector and when
successful it turns the R
-side of the output into an empty list (list()
)
to signal that the end of the vector was detected.
Examples:
literal('abc')(c('abc','def'))
starts_with_a <- function(x) {grepl("^a", x)} satisfy(starts_with_a)(c('abc','def'))
And here is an example of an unsuccessful parser:
literal('a')(c('ab','a'))
An application of eof()
to detect that we parsed the input completely.
(literal("a") %then% literal("att") %then% eof())(c("a","att"))
Notice how the R
-element differs from just
(literal("a") %then% literal("att"))(c("a","att"))
p1 %or% p2
: applies alternative parsers p1
and p2
on the current element
and returns the result of the first successful parser, or failure when both
fail.p1 %then% p2
: applies parser p1
on the current element and p2
on the
next element.The %or%
combinator enables us to try alternative parsers on the current
element, whereas the %then%
combinator enables us to test sequences of
elements in a character vector.
Note that %or%
uses lazy evaluation which means that the output of %or%
depends on the order of p1
and p2
: if both would in principle succeed then
only the result of p1
is returned.
We also have two variations of the %then%
combinator, %xthen%
and %thenx%
which do test but then discard the result from the first or second argument:
p1 %xthen% p2
: where p1
and p2
are parsers discards the result from
p2
p1 %thenx% p2
: where p1
and p2
are parsers discards the result from
p1
Examples:
(literal('A') %or% satisfy(starts_with_a))(c('abc','def'))
(literal('A') %then% satisfy(starts_with_a))(c('A', 'abc'))
(literal('>') %thenx% satisfy(starts_with_a))(c('>', 'abc'))
As said, the six fundamental parsers allow you to construct a parser that will completely consume input. However, when this parser succeeds its output will, apart from the fact that every element is put in a list, be equal to the input. In general, this is not very useful if you want to use the output in other code. Therefore, we have two functions that allow you to modify the output of successful parser. The basic functions for modifying output of a parser are:
p %ret% c
: when parser p
is successful it returns the object c
(a
string or NULL
).p %using% f
: when parser p
is successful function f()
is applied to the
input and its output is stored as the result.Examples
(literal('a') %ret% "We have an 'a'")(c('a','b'))
(satisfy(starts_with_a) %using% toupper)(c('abc','d'))
Derived parsers are constructed from the six fundamental parsers.
zero_or_one(p)
: where p
is a parser.zero_or_more(p)
: where p
is a parser.one_or_more(p)
: where p
is a parser.exactly(n,p)
: where n
is an integer and p
is a parser.match_n(n,p)
: where n
is an integer and p
is a parser.zero_or_one
, zero_or_more
and one_or_more
do exactly what their names
suggest. You should realize that these are greedy parsers: they consume as many
as possible strings that can be successfully parsed by p
. Similarly,
exactly
is a greedy parser, and it fails when there are less or more than n
consecutive strings that can be successfully parsed by p
. On the other hand
match_n
is not greedy. It consumes n
but no more strings that can be
successfully parsed with p
.
Examples:
This parser will fail on its input, too many strings starting with "a":
zero_or_one(satisfy(starts_with_a))(c('acc','aat','cgg'))
The following is a successful parse. Note that its result is not merely
[]
which would have indicated failure, but an L,R
-list with an empty list
in the L
-element.
zero_or_more(satisfy(starts_with_a))(c('cat','gac','cct'))
one_or_more(satisfy(starts_with_a))(c('att','aac','cct'))
exactly(2, satisfy(starts_with_a))(c('att','aac','cct'))
match_n(1, satisfy(starts_with_a))(c('att','aac','cct'))
match_s
When constructing a parser you will often need to recognize as well as process
strings. For example, you want to recognize multiple integers in a line,
extract these and then return them as a numeric vector. You're not interested
in other elements like comments in these strings. This could be achieved by
combining satisfy()
and subsequently %using%
, like:
satisfy(has_integers) %using% process_integers
where has_integers
is a boolean function and process_integers
is a function
that both recognizes , extracts and rearranges numbers to a numeric vector. You
will often find that has_integers
and process_integers
use the same regular
expressions. Then it may be more efficient to combine these, which is what the
match_s()
function does.
The match_s()
parser takes a simple (not higher-order) function s
to process
the string from the current element and returns the result from that function.
The function s
has to be constructed in such a way that it returns the empty
list()
when the string does not satisfy the criteria that the user sets.
Example:
Here numbers
is a function hat recognizes and returns numbers (in fact,
positive integers) in a string:
numbers <- function(x) { m <- gregexpr("[[:digit:]]+", x) matches <- as.numeric(regmatches(x,m)[[1]]) if (length(matches)==0) { return(list()) # we signal parser failure when no numbers were found } else { return(matches) } } match_s(numbers)(" 101 12 187 # a comment on these numbers")
by_split(p, split, finish = TRUE, fixed = FALSE, perl = FALSE)
: where p
is a parserby_symbol(p, finish = TRUE)
: where p
is a parserAlthough you can use the string processing functions from the base
or
stringr
packages to parse and process individual elements of a character
vector it is also possible to parse substrings by first splitting a string.
by_split
uses a split
pattern to first split the incoming string and then
applies the parser p
to it. by_symbol
splits the incoming string to
individual symbols and then applies the parser p
. The finish
boolean
indicates whether the parser should completely consume the split string.
Under the hood these functions use the function strsplit()
and its split
,
fixed
and perl
arguments are passed on.
Examples:
starts_with_a <- function(x) grepl("^a",x[1]) # don's forget to use satisfy(), it turns starts_with into a parser by_split(one_or_more(satisfy(starts_with_a)), ",", fixed = TRUE)("atggc,acggg,acttg")
by_symbol(literal(">") %thenx% one_or_more(literal("b")), finish = FALSE)(">bb")
Note: Parsers become slow when using these two functions extensively.
If that bothers you then you should use the match_s
or satisfy()
and
%using%
parsers together with string processing functions like grepl
and
grep
or the ones from stringr
to process strings. Those parsers will be
much faster.
EmptyLine()
Spacer()
MaybeEmpty()
The function EmptyLine()
detects and returns empty line. Empty lines are
either the string ""
or strings consisting entirely of space-like characters
as identified by the regular expression \\s
. Spacer()
detects one or more
consecutive empty lines and discards these whereas MaybeEmpty()
detects zero
or more empty lines and discards these.
An additional function Ignore()
ignores all lines, whether empty or not, until
the end of the file. This is sometimes useful when the interesting part of a
file has been parsed and all else can be ignored until the end of the file.
Note that I write these functions with capital letters. I use this convention here and in the example below to indicate that these functions parse higher-order structures (higher than the individual strings) in the input.
Examples:
EmptyLine()("")
Spacer()(c(" ","\t\t\t", "atgcc"))
MaybeEmpty()(c("ggacc","gatccg", "atgcc"))
(literal("Interesting") %then% Ignore() %then% eof())(c("Interesting", LETTERS))
As an example of a somewhat realistic application let's try to write a parser for fasta-formatted files for mixed nucleotide and protein sequences.
Such a fasta file could look like the example below
data("fastafile") cat(paste0(fastafile, collapse="\n"))
where the first two are nucleotide sequences and the last is a protein sequence[^1].
[^1]: It is not clear to me whether mixing of sequence types is allowed in the fasta format. I guess not, because a protein sequence consisting entirely of glutamate (G), alanine (A), threonine (T) and cysteine (C) would not be distinguishable from a nucleotide sequence. Such protein sequences would be extremely rare. Anyway I demonstrate here that apart from this ambiguous case it is easy to parse them from a single file.
Since fasta files are text files we could read such a file using readLines()
.
Below we simulate the result of reading the file above by loading the
nuclfasta
and protfasta
data sets present in the package. The consist of
character vectors.
data("fastafile")
We can distinguish the following higher order components in a fasta file:
{G,A,T,C}
.{A,R,N,D,B,C,E,Q,Z,G,H,I,L,K,M,F,P,S,T,W,Y,V}
.[^2]: Note that real fasta headers and sequences can have more complicated formats than I pretend here.
It now becomes clear what I mean when I say that the package allows us to write
transparent parsers: the description above of the structure of fasta files can
be translated straight into code for a Fasta()
parser:
Fasta <- function() { one_or_more(SequenceBlock()) %then% eof() } SequenceBlock <- function() { MaybeEmpty() %then% Header() %then% (NuclSequence() %or% ProtSequence()) } NuclSequence <- function() { one_or_more(NuclSequenceString()) } ProtSequence <- function() { one_or_more(ProtSequenceString()) }
Notice that these elements are functions taking no input, hence the empty
argument brackets ()
behind their names. They can take input when needed,
for example to change their behavior (like match_n()
, or see the other
example below).
Now we need to define the string-parsers Header()
, NuclSequenceString()
and ProtSequenceString()
that recognize and process these elements in the
character vector fastafile
. We use functions from stringr
to do this in
three helper functions, and we use match_s()
to create parsers from these.
# returns the title after the ">" in the sequence header parse_header <- function(line) { # Study stringr::str_match() to understand what we do here m <- stringr::str_match(line, "^>(\\w+)") if (is.na(m[1])) { return(list()) # signal failure: no title found } else { return(m[2]) } } # returns a nucleotide sequence string parse_nucl_sequence_line <- function(line) { # The line must consist of GATC from the start (^) until the end ($) m <- stringr::str_match(line, "^([GATC]+)$") if (is.na(m[1])) { return(list()) # signal failure: not a valid nucleotide sequence string } else { return(m[2]) } } # returns a protein sequence string parse_prot_sequence_line <- function(line) { # The line must consist of ARNDBCEQZGHILKMFPSTWYV from the start (^) until the # end ($) m <- stringr::str_match(line, "^([ARNDBCEQZGHILKMFPSTWYV]+)$") if (is.na(m[1])) { return(list()) # signal failure: not a valid protein sequence string } else { return(m[2]) } }
Then we define the parsers.
Header <- function() { match_s(parse_header) } NuclSequenceString <- function() { match_s(parse_nucl_sequence_line) } ProtSequenceString <- function() { match_s(parse_prot_sequence_line) }
Now we have all the elements that we need to apply the Fasta()
parser.
Fasta()(fastafile)
Apart from match_s()
, we have used only the six fundamental parsers. Therefore,
the output is almost the same as the parsed input. This is not very useful
because it is difficult to extract the individual sequences and titles from it;
we would have to write sort of parser again to process this output. To mend
this, we have to modify the output of the parsers. The first thing that we will
do is to let every sequence block be returned as an element of a list. To
achieve this we extend the SequenceBlock
parser by changing its output with
the %using%
operator:
SequenceBlock <- function() { MaybeEmpty() %then% Header() %then% (NuclSequence() %or% ProtSequence()) %using% function(x) list(x) }
Now the result is a list of three lists, one for each sequence block.
Fasta()(fastafile)[["L"]]
In principle, this output is easier to extract information from, but we can
improve on it. First, we want the sequences to appear as one long string, not
as separate character vectors corresponding to the lines in the sequence block.
Therefore, we extend the NuclSequence
and ProtSequence
parsers collapsing
their output:
NuclSequence <- function() { one_or_more(NuclSequenceString()) %using% function(x) paste0(x, collapse = "") } ProtSequence <- function() { one_or_more(ProtSequenceString()) %using% function(x) paste0(x, collapse="") }
Then we get
Fasta()(fastafile)[["L"]]
This looks much better: we know that the first element in each of these lists
is the title and the second element is the complete sequence. Then why not
just attach a name to these elements? This would make extracting the
information even easier. Furthermore, we also report whether the sequence is a
nucleotide or a protein sequence by adding a type
tag.
Header <- function() { match_s(parse_header) %using% function(x) list(title = unlist(x)) } NuclSequence <- function() { one_or_more(NuclSequenceString()) %using% function(x) list(type = "Nucl", sequence = paste0(x, collapse="")) } ProtSequence <- function() { one_or_more(ProtSequenceString()) %using% function(x) list(type = "Prot", sequence = paste0(x, collapse="")) }
Finally, we have our desired output.
d <- Fasta()(fastafile)[["L"]] d
Let's present the result more concisely using the names of these elements:
invisible(lapply(d, function(x) {cat(x$type, x$title, x$sequence, "\n")}))
In the examples above we showed how to create parsers without parameters. It is easy and useful to sometimes create parsers with parameters. The parameters are used to change the behavior of the parsers. For example, when writing online course material I use a simple structured question template that is converted to html when the syllabus is generated. It consists mostly of markdown content. Its parser makes use of parametrized parsers. The structure of such a question template document is as follows[^3]:
[^3]: I simplified the template and code for this example. In fact the content is processed differently depending on the type of element, meaning that Content()
is a function of type
. Furthermore, questions are automatically numbered.
qtemp <- c( "#### INTRO", "## Title about a set of questions", "", "This is optional introductory text to a set of questions.", "Titles preceded by four hashes are not allowed in a question template.", "", "#### QUESTION", "This is the first question", "", "#### TIP", "This would be a tip. tips are optional, and multiple tips can be given. Tips are", "wrapped in hide-reveal style html elements.", "", "#### TIP", "This would be a second tip.", "", "#### ANSWER", "The answer to the question is optional and is wrapped in a hide-reveal html element.", "", "#### QUESTION", "This is the second question. No tips for this one", "", "#### ANSWER", "Answer to the second question" )
cat(paste0(c(qtemp,"","<optionally more questions>"), collapse="\n"))
I stored this example content in a vector qtemp
to parse it later.
You notice the recurring structure of a header with four hashes ####
and some text following it. These headers represent four types of elements: intro, question, tip and answer. Instead of writing separate parsers we could
create a generic parser for such elements as:
HeaderAndContent <- function(type) { (Header(type) %then% Content()) %using% function(x) list(list(type=type, content=unlist(x))) }
Then we define each of the four parsers as:
Intro <- function() HeaderAndContent("intro") Question <- function() HeaderAndContent("question") Tip <- function() HeaderAndContent("tip") Answer <- function() HeaderAndContent("answer")
The function Header(type)
is defined as
Header <- function(type) satisfy(header(type)) %ret% NULL # This must also be a generic function: a function that generates a function to # recognize a header of type 'type' header <- function(type) { function(x) grepl(paste0("^####\\s+", toupper(type), "\\s*"), x) }
The content consists of one or more lines not starting with ####
, which
includes empty lines. We discard trailing empty lines.
Content <- function() { (one_or_more(match_s(content))) %using% function(x) stringr::str_trim(paste0(x,collapse="\n"), "right") } content <- function(x) { if (grepl("^####", x)) list() else x }
The complete template is defined as follows
Template <- function() { zero_or_more(Intro()) %then% one_or_more(QuestionBlock()) %then% eof() }
where QuestionBlock()
is defined using the previously defined elements as
QuestionBlock <- function() { Question() %then% zero_or_more(Tip()) %then% zero_or_one(Answer()) %using% function(x) list(x) }
We can now parse the input. We wrap the Template()
parser in the reporter()
function to have proper error messaging and warnings, if applicable. Furthermore
only the L
-element, the parsed input, is returned.
reporter(Template())(qtemp)
Any scripts or data that you put into this service are public.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.