A PEG parser takes a string as input and attempts to parse it as a given nonterminal. The key idea of the PEG implementation is that every nonterminal is just a function that takes a string as an argument and attempts to parse that string as its nonterminal. The functions always start from the beginning, but a parse is considered successful if there is material left over at the end.
This makes it easy to model different PEG parsing operations. For
instance, consider the PEG grammar "ab"
, which could also be
written (and "a" "b")
. It matches the string “ab”. Here’s how
that might be implemented in the PEG style:
(define (match-and-a-b str) (match-a str) (match-b str))
As you can see, the use of functions provides an easy way to model
sequencing. In a similar way, one could model (or a b)
with
something like the following:
(define (match-or-a-b str) (or (match-a str) (match-b str)))
Here the semantics of a PEG or
expression map naturally onto
Scheme’s or
operator. This function will attempt to run
(match-a str)
, and return its result if it succeeds. Otherwise it
will run (match-b str)
.
Of course, the code above wouldn’t quite work. We need some way for the parsing functions to communicate. The actual interface used is below.
A parsing function takes three arguments - a string, the length of that string, and the position in that string it should start parsing at. In effect, the parsing functions pass around substrings in pieces - the first argument is a buffer of characters, and the second two give a range within that buffer that the parsing function should look at.
Parsing functions return either #f, if they failed to match their nonterminal, or a list whose first element must be an integer representing the final position in the string they matched and whose cdr can be any other data the function wishes to return, or ’() if it doesn’t have any more data.
The one caveat is that if the extra data it returns is a list, any
adjacent strings in that list will be appended by match-pattern
. For
instance, if a parsing function returns (13 ("a" "b" "c"))
,
match-pattern
will take (13 ("abc"))
as its value.
For example, here is a function to match “ab” using the actual interface.
(define (match-a-b str len pos) (and (<= (+ pos 2) len) (string= str "ab" pos (+ pos 2)) (list (+ pos 2) '()))) ; we return no extra information
The above function can be used to match a string by running
(match-pattern match-a-b "ab")
.
PEG expressions, such as those in a define-peg-pattern
form, are
interpreted internally in two steps.
First, any string PEG is expanded into an s-expression PEG by the code
in the (ice-9 peg string-peg)
module.
Then, the s-expression PEG that results is compiled into a parsing
function by the (ice-9 peg codegen)
module. In particular, the
function compile-peg-pattern
is called on the s-expression. It then
decides what to do based on the form it is passed.
The PEG syntax can be expanded by providing compile-peg-pattern
more
options for what to do with its forms. The extended syntax will be
associated with a symbol, for instance my-parsing-form
, and will
be called on all PEG expressions of the form
(my-parsing-form ...)
The parsing function should take two arguments. The first will be a
syntax object containing a list with all of the arguments to the form
(but not the form’s name), and the second will be the
capture-type
argument that is passed to define-peg-pattern
.
New functions can be registered by calling (add-peg-compiler!
symbol function)
, where symbol
is the symbol that will indicate
a form of this type and function
is the code generating function
described above. The function add-peg-compiler!
is exported from
the (ice-9 peg codegen)
module.