1 Introduction

PARSEQ

Description

Parseq (pronounced parsec) is a parsing library for common lisp.
It can be used for parsing lisp's sequences types: strings, vectors (e.g. binary data) and lists.
Furthermore, parseq is able to parse nested structures such as trees (e.g. lists of lists, lists of vectors, vectors of strings).

Parseq uses parsing expression grammars (PEG)
that can be defined through a simple interface.
Extensions to the standard parsing expressions are available.
Parsing expressions can be parameterised and made context aware.
Additionally, the definition of each parsing expression allows the arbitrary transformation of the parsing tree.

The library is inspired by Esrap and uses a very similar interface.
No code is shared between the two projects, however.
The features of Esrap are are mostly included in parseq and complemented with additional, orthogonal features.
Any resemblance to esrap-liquid is merely coincidental.

The library is still under development.
This means that some features are not yet implemented and that the interface and behaviour may change in the future.
See the warnings below.

Basic usage

The first argument to (defrule ...) is the nonterminal symbol that will represent the rule.
These symbols use a different namespace from everything else.
The second argument is a list of parameters that the rule takes (none in this example).
The third argument specifies the definition of the nonterminal symbol.
After the third argument, multiple processing options can be listed.

In this example, the nonterminal foo requires the parsing of the string "foo".
The rule foobar combines the two rules foo and bar to match the string "foobar", the list ("foo" "bar") or the vector #("foo" "bar").
The above example could alternatively be stated as

(defrule foobar () (and "foo" "bar"))

thus not requiring the rules foo and bar.

Parsing is initiated by calling

(parseq 'foobar sequence)

which will return the list ("foo" "bar") as well as T if sequence is one of "foobar", ("foo" "bar") and #("foo" "bar").
If parsing is not successful, NIL is returned.
The first argument to (parseq ...) is a nonterminal symbol defined through defrule.
Note that the symbol must be quoted.
The second argument is the sequence that should be parsed.
There are optional keyword parameters to (parseq ...):

:start: the position in the sequence at which parsing should start

:end: the position in the sequence at which parsing should stop

:junk-allowed: when set to t, will avoid a parsing error, if the end is not reached

:parse-error: when set to t, a parsing error will signal a error

This concludes the basic usage of the library. Almost everything is done through defrule and parseq.
There are some extra arguments, however, that are explained below.

Installation

in the REPL to download, install and load it.
To access the symbols from the package without the parseq: prefix you can type

(use-package :parseq)

Alternatively, the system is can be downloaded/cloned from GitHub.
The master branch will always point to the latest release.
If the system can be found through ASDF, the system can be loaded by typing the following expressions in the REPL:

Terminals

Terminals (tokens) are the objects that actually appear in the parsed sequence.
The following types are item classes:

symbol matches any lisp symbol

keyword matches any lisp keyword

form matches literally everything

char matches any character

stdchar matches any standard character, see (standard-char-p ...)

alpha matches any alphabetic standard character

digit matches any numeric standard character

alphanumeric matches any alphanumeric standard character

byte matches any unsigned byte

number matches any number

integer matches any integer

atom matches any atom

list matches any list

cons matches any cons

vector matches any vector

string matches any string

t matches anything not nil

nil matches nil or an empty list

Literal values can be specified to match specific items or subsequences in the sequence being parsed:

'foo matches the symbol foo

#\f matches the character 'f'

"foo" matches the subsequence "foo" in a string or the item "foo" in a list or vector

#(1 2 3) matches the the subsequence #(1 2 3) in a vector or the item #(1 2 3) in a list

5 matches the number 5

Terminal expressions allow for mor elaborate specifications:

(char "a-zA-D7-9.,;<=>-") matches any character from the characters or character ranges in the
given string

More terminals may be available in later versions of parseq.

Nonterminals

Nonterminal symbols represent parsing expressions that consist of terminals and/or other nonterminals.
They can be defined using the defrule macro.

Standard expressions

These are the standard combinations of a parsing expression grammar.

Sequence

(and subexpression ...)

The expression succeeds for a sequence if all subexpressions succeed in the given order.
It produces a list of the subexpression results.
On success, the amount of input consumed is determined by the subexpressions.

Ordered choice

(or subexpression ...)

The subexpressions are tried in the given order and the result of the first one that succeeds is accepted.
It produces the result of the succeeding subexpression.
The amount of input consumed depends on the subexpression.
If none of the subexpressions match, the expression fails.

Greedy repetition

(* subexpression)

Tries subexpression consecutively as many times as possible.
This expression always succeeds because zero repetitions are allowed.
A list of the succeeding matches is returned.
The amount of input consumed depends on the subexpression and the number of times it matches.

Greedy positive repetition

(+ subexpression)

Like (* subexpression), but at least one repetition is required.

Optional

(? subexpression)

Consumes and returns whatever subexpression consumes if it succeeds.
This operation is greedy which means that if the subexpression matches, it will consume the corresponding input.
If the subexpression fails, no input is consumed and NIL returned.

Followed-by predicate

(& subexpression)

Succeeds, if subexpression succeeds, but consumes no input.
The result of the subexpression is returned.

Not-followed-by predicate

(! subexpression)

Succeeds if subexpression does not succeed.
This expression consumes no input.
When successful, the next sequence item is returned.

Extended expressions

For convenience, the standard expressions are extended by the following combination methods:

Negation

(not subexpression)

Succeeds, if the subexpression does not succeed.
When that happens, the rule consumes exactly one item in the sequence and returns it.

Sequence (unordered)

(and~ subexpression ...)

The expression succeeds for a sequence if all subexpressions succeed, in any order.
Note that the first subexpression in the list that matches a sequence item will be used.
The expression produces a list of the subexpression results (in the order in which they are listed in the rule definition) and consumes whatever the subexpressions consume.

The first argument to and~~ specifies how may times each subexpression is allowed to be repeated.
(See the rep operator above for a list of abbreviations.)
In this example, the first subexpression is required exactly once, the second one exactly twice, the third zero times or once, the fourth between 2 and 3 times and the fifth at least 4 times.
During parsing, the prioritisation of the subexpressions that are still applicable is from left to right.
The result is a list of lists:
The list is ordered in the same way that subexpressions are given in the rule definition.
The n-th list within the list contains the results of the n-th subexpression in the order in which they are found in the parsed expression.

and used in parsing expressions as (html-tag "a") instead of having to define each rule separately.
The parameters can also be used in the processing options (see below).
Note, however, that since the rule parameter is unknown at compile time, a runtime dispatch will need to be performed on the parameter type.
In the current implementation, only literal terminals and nonterminals are supported as part of the parsing expression.
This will likely change in the future.

It is possible to pass multiple paramters, keywords etc.
The full syntax of lambda lists is allowed.

Processing options

There are several options that can be specified for each rule definition.
The options can have several effects which are described below.

You can specify how processing of the parsing result is done through multiple options (see below).
Additional options (such as :test) do not affect the parse result, but have other effects.
Note that the options are processed in sequence and the output of the previous option is input into the next option:

Lambda / Destructure

Destructures the parsing result according to the specified lambda list and binds the given variables.
The following forms can be used to process the result in whatever way.
The new parsing result is given by what the last form returns.
Note that :lambda and :destructure are actually synonyms.

Choose

(:choose 0 2 '(3 1))

Picks items from the parsing result using the given indices and returns them.
For the parsing result (15 #\: 24 (#\: 57)) (which might be the parsing result of a time string), the above processing option would return (15 24 57).
Non-existent items will result in NIL being returned for them.

Function

(:function #'+)

The parsing result is handed over to the function specified (here: +).
Note that the lambda list of the given function has to match the items in the parsing result.
The new parsing result is whatever the function returns.

Identity

(:identity t)

Returns the parsing result if the argument is not NIL and NIL otherwise.

Flatten

String

(:string)

Flattens the parsing result and concatenates the list items to into a string, i.e. (#\a ("bc" (#\d) 'e) (#\f #x67)) becomes "abcdEfg".
The items that can be concatenated are strings, characters, unsigned bytes and symbols.

Vector

Parse result testing

These options do not affect the parse result, but can make the rule fail depending on their input.
If a rule fails because of such an option, the processing of subsequent options is skipped.

Test

(:test (x) (and (numberp x) (> x 10)))

Like :lambda or :destructure, except that the return value of the function body is used as a predicate to determine whether the rule succeeds or not.
Therefore, if the body of the test returns NIL, the rule fails.
Otherwise, the unaltered result is returned.
Note that the input to the test (function arguments) depends on the preceding processing options.

In the above example, the rule fails if the parse result is not an number greater than 10.

The following rule matches any symbol except baz:

(defrule not-baz () symbol (:not (x) (eql x 'baz)))

This is not possible with (not 'baz) because that would allow terminals other than symbols, e.g. 5 (which is not a symbol).

Reverse test

(:not (x) (and (numberp x) (> x 10)))

Same as :test, but logically inverted.
In this example, the rule fails if the parse result is an number greater than 10.

Variables

Rules can bind variables that can be accessed/modified by subexpressions.

Variable binding

(:let a b (c 10))

Binds the specified variables (dynamically).
Subexpressions (and subexpressions of subexpressions, etc) of the rule have access to these variables and can even modify them.
In order to access the variables, the subexpressions have to declare them using :external (see below).
If a subexpression binds the same variable with another :let, the previous binding is shadowed until the subexpression returns.
For an example of variable usage, see the section 'Using context' below.

External bindings

(:external a b c)

Declares the specified variables.
If the rule is called by a superior rule that binds these variables (using :let, see above), this rule can use and modify the variables.
It is an error if a rule using external variables is called when the variables are unbound (i.e. the rule must be called as a subexpression to a rule defining the variables).

Packrat parsing

Packrat parsing can be enabled for a rule by using the (:packrat t) option.
See the wiki page for more information.

Using context

Parsing in parseq can be made context aware using two methods.

Context through tests

A single rule can verify one part of its result against another.
For instance, if a tag is a name followed by a type and a value, then the rule

Within its body, rules from outside are still defined and can be redefined without affecting the outside.
The rules from outside are saved before entering the body and restored when the body returns.

Upcoming features

These features may be implemented in the future:

Support for rule parameters in parsing expressions other than literal terminals and nonterminals.

Short forms for combined nonterminals, e.g.

(? (and ...))

(? (or ...))
or multiple arguments to (? ...) signifying either a sequence or a choice.

Support for streams (how?)

Speed and efficiency

Custom terminals

Custom non-terminal expressions

Custom sequences, i.e. parse anything

Warnings

Please heed the following warnings:

The interface and behaviour of parseq are not yet frozen.
New versions may break programs using the library.

The library should work with SBCL, CMUCL, ECL and CLISP.
Other lisp implementations are untested.

Parseq comes with no warranty whatsoever.

Licence

Parseq is distributed with the GNU General Public License, version 2:

Copyright (C) 2017 Marco Rossini

This program is free software; you can redistribute it and/or modify it under the terms of the GNU General Public License version 2 as published by the Free Software Foundation.

This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.

You should have received a copy of the GNU General Public License along with this program; if not, write to the Free Software Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301, USA.