I want to describe some thoughts about parser construction, design decisions and error messaging, mostly for reference, but also to understand how these parsers work.
%or%, and constructing parsersThe parser p1 %or% p2 performs lazy evaluation, meaning
that if p1 succeeds then p2 is never tested.
This is in contrast to the parser that Hutton
(1992) describes1. The choice for lazy evaluation is
motivated by the realization (laziness?) that a parser that evaluates
all possible alternatives may quickly become extremely complex: it
splits into two alternative paths with every %or%. I
wondered whether this decision would cause practical problems, because,
in principle if both p1 and p2 would be
successful the result of this lazy evaluating parser will depend on the
order in which the two alternatives are put. However, I do not believe
that this should lead to problems in practical applications, for two
reasons:
There is one situation in which you could get into problems, namely when the differentiation between two document formats depends on their full consumption by a parser. For example:
Format-1
XXXX
YYYY
ZZZZand Format-2
XXXX
YYYY
QQQQcan only be distinguished when the last line is read. If two
alternative parsers do not read beyond the YYYY, then both
will be successful, and we have an ambiguous interpretation. This is why
we urge you to put an eof() parser at the end of any parser
that you construct. It is also the reason why a warning is issued when a
parser does not completely consume the input2.
When you just get a fail [] as the output when a parser
fails on a long input then it can become challenging to find out what
went wrong. It would help a lot if you would know the index number
(“line” number) of the input on which the parser fails. This means that
the parser has to track where it is and where it fails. Since the only
function that shifts to a next index is the %then%
combinator, and since any other function that shifts to the right, like
the repeater functions (zero_or_more(), etc.) is
based on this function, we just have to count the number of times the
%then% combinator was called up to the failure to know
where the parser failed. A complication is that we may be testing
alternative parsers implemented by the %or% combinator, and
the parser may fail on different lines in the alternative “arms”. In
that case we will say that the parser fails in the element with the
highest index when both arms of an %or% combinator fail.
The package returns a marker object (printed as
[]) with that element index number as an attribute. The
principle is illustrated in the figure below.
Illustration of marker positioning when a parser fails on an
input vector of length 6. Each filled dot represents a
successful parse. A %then% combinator shifts to the next
element whereas an %or% combinator applies alternative
parsers to the input. The parser fails at the red crosses and the red
square. A marker (red square) will be put at the parser that fails at
the input element with the highest index, which is element #5 here.
In the current implementation we chose to track the current element index number with a state variable. The principle of using state variables is described in section 7.4 of R Packages (2e).
Note that proper tracking and error messaging only takes place when
the parser is wrapped in the reporter() function. Below I
illustrate the example above by parsing the first 6 letters of the
alphabet by a parser that fails on the indicated positions.
p <- function() {
  literal("A") %then%
    literal("B") %then%
    (arm1() %or% arm2())
}
arm1 <- function() {
  literal("D") %then%
    literal("E") %then%
    literal("F") %then%
    literal("G")
}
arm2 <- function() {
  literal("C") %then% 
    (arm21() %or% arm22())
}
arm21 <- function() {
  literal("D") %then%
    literal("F") %then%
    literal("G")
}
arm22 <- function() {
  literal("E") %then%
    literal("F") %then%
    literal("G")
}
try(reporter((p() %then% eof()))(LETTERS[1:6]))
#> Error : Parser failed on line 5 of input.
#> Line content: "E"An now successfully parsing on arm22
rexAfter making the first public version of this package I discovered
that there exists a package called rex (“Friendly
regular expressions”) that uses a very similar approach and in a number
of cases even the same commands as this package. The examples in the
package show that it is used to construct parsers for strings,
i.e. single-element character vectors, not multi-element
character vectors. I haven’t tested it, but if it is fast enough I can
imagine that it can be used in parcr’s function
match_s() to create parsers for strings that have a complex
structure.