Parse-EZ : Clojure Parser Library

Parse-EZ is a parser library for Clojure programmers. It allows easy
mixing of declarative and imperative styles and does not
require any special constructs, macros, monads, etc. to write custom parsers.
All the parsing is implemented using regular Clojure functions.

The library provides a number of
parse functions and combinators and comes with a built-in customizable infix
expression parser and evaluator. It allows the programmer to concisely specify
the structure of input text using Clojure functions and easily build parse trees
without having to step out of Clojure. Whether you are writing a parser
for some well structured data or for data scraping or prototyping a new language,
you can make use of this library to quickly create a parser.

Features

Parse functions and Combinators

Automatic handling of whitespaces, comments

Marking positions and backtracking

Seek, read, skip string/regex patterns

Builtin customizable expression parser and evaluator

Exceptions based error handling

Custom error messages

Usage

Installation

Just add Parse-EZ as a dependency to your lein project

[protoflex/parse-ez"0.4.1"]

and run

leindeps

A Taste of Parse-EZ

Here are a couple of sample parsers to give you a taste of the parser library.

CSV Parser

A CSV file contains multiple records, one-record per line, with field-values separated by a delimiter
such as a comma or a tab. The field values may optionally be quoted either using a single or double
quotes. When field-values are quoted, they may contain the field-delimiter characters, and in such
cases they will not be treated as field separators.

First, let us define a parse function for parsing one-line of csv file:

(defn csv-1[sep](sep-by#(any-stringsep)#(chrsep)))

In the above function definition, we make use of the parse combinator sep-by
which takes two arguments: the first one to read a field-value and the second
one to read the separator. Here, we have used Clojure's anonymous function shortcuts to
specify the desired behavior succinctly. The any-string function matches a single-quoted
string or a double-quoted string or a plain-string that is followed by the specified separator
sep. This is exactly the function that we need to read the field-value. The second argument
provided to sep-by above uses the primitive parse function chr which succeeds only when
the next character in the input matches its argument (sep parameter in this case). The csv-1 function returns the field values as a vector.

The sep-by function actually takes a third, optional argument as record-separator
function with the default value of a function that matches a newline. We didn't
pass the third argument above because the default behavior suits our purpose.
Had the default behavior of sep-by been different, we would have written the
above function as:

(defn csv-1[sep](sep-by#(any-stringsep)#(chrsep)#(regex#"\r?\n")))

Now that we have created a parse function to parse a single line of CSV
file, let us write another parse function that parses the entire CSV file
content and returns the result as a vector of vector of field values
(one-vector per record/line). All we need to do is to repeatedly apply the
above defined csv-1 function and the multi* parse combinator does
just that.

Just one small but important detail: by default, Parse-EZ
automatically trims whitespace after successfully applying a parse function.
This means that the newline at the end of line would be consumed after reading
the last field value and the sep-by would be unable to match the end-of-line
which is the record-separator in this case. So, we will disable the newline
trimming functionality using the no-trim combinator.

(defn csv[sep](multi*(fn [](no-trim#(csv-1sep)))))

Alternatively, you can express the above function a bit more easily using the macro versions of combinators introduced in Version 0.3.0 as follows:

(defn csv[sep](multi*(no-trim_(csv-1sep))))

Now, let us try out our csv parser. First let us define a couple of test
strings containing a couple of records (lines) each. Note that the second
string contains a comma inside the first cell (a quoted string).

Well, all we had to do was to write two lines of Clojure code to implement the CSV parser.
Let's add a bit more functionality: the CSV files may use a comma or a tab character to
separate the field values. Let's say we don't know ahead of time which character
a file uses as a separator and we want to detect the separator automatically. Note
that both characters may occur in a data file, but only one acts as a field-separator -- that too
only when it's not inside a quoted string.

Here is our strategy to detect the separator:

if the first field value is quoted (single or double), read the quoted string

Note how we used the mark-pos and back-to-mark Parse-EZ functions to 'unconsume'
the consumed input.

The complete code for the sample CSV parser with the separator-detection functionality is
listed below (you can find this in csv_parse.clj file under the examples directory.

(ns protoflex.examples.csv_parse(:use[protoflex.parse]))(declare detect-sepcsv-1)(defn csv"Reads and returns one or more records as a vector of vector of field-values"([](csv(no-trim#(detect-sep))))([sep](multi*(fn [](no-trim-nl#(csv-1sep))))))(defn csv-1"Reads and returns the fields of one record (line)"[sep](sep-by#(any-stringsep)#(chrsep)))(defn detect-sep"Detects the separator used in a csv file (a comma or a tab)"[](let [m(mark-pos)s(attempt#(anydq-strsq-str))s(if ss(no-trim#(read-to-re#",|\t")))sep(read-ch)](back-to-markm)sep))

Let's try out the new auto-detect functionality. Let us define two new test
strings s3 and s4 that use tab character as field-separator.

As you can see, this time we didn't specify what field-separator to use: the parser
itself detected the field-separator character and used it, returning us the desired
results.

XML Parser

Here is the listing of a sample XML parser implemented using Parse-EZ. You can find the
source file in the examples directory. The parser returns a map containing keys and values
for :tag, :attributes and :children for the root element. The value for :attributes key
is itself another map containing attribute names and their values. The value for :children
key is a vector (potentially empty) containing string content and/or maps for child elements.

The function parse-xml is the entry point that kicks off parsing of input xml string xml-str. It passes the between combinator to Parse-EZ's parse function. Here, the call to between returns the value returned by the element parse function, ignoring the content surrounding it (matched by prolog and pi functions). The block-comment delimiters are set to match XML's and the line-comment delimiter is cleared (by default these match Java comments).

The parse function pi is used to skip consecutive processing instructions by using the delimiters <? and ?>.

The parse function prolog is used to skip DTD declaration (if any) and also any surrounding processing instructions. Note that the regex used to match DTD declaration is only meant for illustration purposes. It isn't complete but will work in most cases.

The element parse function matches an xml element and returns the tag, attribute list and children in a hash map. Note the usage of the look_ahead* combinator to handle both the cases -- with children and without children. If it sees a ">" after reading the attributes, the look-ahead* function calls the children-and-close parse function to read children and the element close tag. On the other hand, if it sees "/>" after the attributes, it calls the (almost) empty parse function that simply returns an empty list.

Each child item is read using the elem-or-text parse function while ignoring any surrounding processing instructions using the between combinator; the combinator multi* is used to read all the child items.

The look-ahead parse combinator is used to call different parse functions
based on different lookahead strings. Note that the look-ahead function
doesn't consume the lookahead string unlike the look-ahead* function used
earlier (in the definition of element parse function).

Comments and Whitespaces

By default, Parse-EZ automatically handles comments and whitespaces. This
behavior can be turned on or off temporarily using the macros with-trim-on
and with-trim-off respectively. The parser option :auto-trim can be used to
enable or disable the auto handling of whitespace and comments. Use the parser
option :blk-cmt-delim to specify the begin and end delimiters for block
comments. The parser option :line-cmt-start can be used to specify the line
comment marker. By default, these options are set to java/C++ block and line
comment markers respectively. You can alter the whitespace recognizer by setting
the :ws-regex parser option. By default it is set to #"\s+".

Alternatively, you can turn off auto-handling of whitespace and comments and use
the lexeme function which trims the whitespace/comments after application of the
parse-function passed as its argument.

Note the parse error for the last parse call. By default, the parse function parses to the
end of the input text. Even though the first 3 characters of the input text is recognized
as valid input, a parse error is generated because the input cursor would not be at the
end of input-text after recognizing "abc".

The parser option :eof can be set to false to allow recognition of partial input:

user>(parse #(string-in["abc""def"])"abcx":eoffalse)"abc"user>

You can start parsing by looking for some marker patterns using the read-to,
read-to-re, skip-over, skip-over-re functions.

You can create your own parse functions on top of primitive parse-functions and/or
parse combinators provided by Parse-EZ.

Committing to a particular parse branch

Version 0.4.0 added support for committing to a particular parse branch via
the new parse combinators commit and commit-on. These functions make the
parser commit to the current parse branch, making the parser report subsequent
parse-failures in the current branch as parse-errors and preventing it
from trying other alternatives at higher levels.

Nesting Parse Combinators Using Macros

Version 0.3.0 of Parse-EZ adds macro versions of parse combinator functions
to make it easy to nest calls to parse combinators without having to write
nested anonymous functions using the "(fn [] ...)" syntax. Note that Clojure
does not allow nesting of anonymous functions of "#(...)" forms. Whereas
the existing parse combinators take parse functions as arguments and actually
perform parsing and return the parse results, the newly added macros take
parse expressions as arguments and return parse functions (to be passed
to other parse combinators). These macros are named the same as the
corresponding parse combinators but with an underscore ("_") suffix. For example
the macro version of "any" is named "any_".

Error Handling

Parse Errors are handled in Parse-EZ using Exceptions. The default error messages generated
by Parse-EZ include line and column number information and in some cases what is expected
at that location. However, you can provide your own custom error messages by using the
expect parse combinator.

Expressions

Parse-EZ includes a customizable expression parser expr for parsing expressions in infix
notation and an expression evaluator function eval-expr to evaluate infix expressions.
You can customize the operators, their precedences and associative properties using
:operators option to the parse function. For evaluating expressions, you can optionally
specify the functions to invoke for each operator using the :op-fn-map option.

Parser State

The parser state consists of the input cursor and various parser options (specified or derived)
such as those affecting whitespace and comment parsing, word recognizers, expression parsing,
etc. The parser options can be changed any time in your own parse functions using set-opt.

Note that most of the parse functions affect Parser state (e.g: input cursor) and hence they are
not pure functions. The side-effects could be avoided by making the Parser State an explicit
parameter to all the parse functions and returning the changed Parser State along with the parse
value from each of the parse functions. However, the result would be a significantly programmer
unfriendly API. We made a design decision to keep the parse fuctions simple and easy to use
than to fanatically keep the functions "pure".

Relation to Parsec

Parsec is a popular parser combinator library written in Haskell. While Parse-EZ
makes use of some of the ideas in there, it is not a port of Parsec to Clojure.