Writing Simple Parser in Python

From time to time one might need to write simple language parser to implement some domain specific language for his application. As always python ecosystem offers various solutions – overview of python parser generators is available here. In this article I’d like to describe my experiences with parsimonious package. For recent project of mine ( imap_detach – a tool to automatically download attachment from IMAP mailbox) I needed simple expressions to specify what emails and what exact parts should be downloaded.

Requirements

I needed a parser for simple logical expressions, which use a set of predefined variables ( properties of email), like these:

Shell

1

mime = "image/jpg" & attached & ! seen

Meaning: Email part is jpg image and is added as attachment and have not been seen yet

Shell

1

name ~=".pdf" & ( from ~= "jack" | from ~= "jim" )

Meaning all email part where filename contains .pdf and is from jack or jim

Grammar

Parsimonious implements PEG grammar which enables to create very compact grammar definitions. Unlike some other parser generators, where grammar in expressed in Python, parsimonious has it’s own syntax, which enables to create short and easy to overview grammar definitions:

Python

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

GRAMMAR=r""" # Test grammar

expr = space or space

or = and more_or

more_or = ( space "|" space and )*

and = term more_and

more_and = ( space "&" space term )*

term = not / value

not = "!" space value

value = contains / equals / bracketed / name

bracketed = "(" space expr space ")"

contains = name space "~=" space literal

equals = name space "=" space literal

name = ~"[a-z]+"

literal = "\"" chars "\""

space = " "*

chars = ~"[^\"]*"

"""

There are couple of things which has to be remembered, when creating grammar:

PEG grammar should avoid left recursion, so rules like

Shell

1

and = expr space "&" space expr

Don not work and will result in recursion error ( infinite recursion). They have to be rewritten as indirect left recursion, which might be sometime challenging.

PEG grammar grammar are more deterministic that context free grammars – so there is always one rule that is matched – there is no ambiguity like in CFG. This is assured by deterministic selection / – always first match is selected and by greedy repetition operators * + ? (they always match maximum possible length from input).
Practically this means that rules has to be more carefully designed and they need to be design in particular way to assure required priority of operators.

Only named rules can have some special treatment when walking AST tree ( see below). I think this is special feature of parsimonious, but AST contains nodes for a part of rule – like expressions that are in brackets. This means if I have rule like this:

Shell

1

and = term ( space "&" space term )*

I have no control on evaluating the right part, so I rather split it into two rules.

Evaluation

Once we have the grammar, we can evaluate expressions by walking the parsed AST. Parsimonious provides nice support for this with visitor pattern . We can create subclass of parsimonious.NodeVisitor with visit_rulename methods for all (relevant) rules from the grammar. visit method receives current node in AST and list of values from its already visited (evaluated ) children.

We will need some supporting code (error class and function to decode binary strings to unicode):

Python

1

2

3

4

5

6

7

8

classEvalError(Exception):

def__init__(self,text,pos=0):

super(EvalError,self).__init__(text+' at position %d'%pos)

defdecode(s):

ifisinstance(s,six.binary_type):

returns.decode('UTF-8')

returns

So here is example of NodeVisitor class, that evaluates our simple expressions:

Python

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

30

31

32

33

34

35

36

37

38

39

40

41

42

43

44

45

46

47

48

49

50

51

52

53

54

55

56

57

58

59

60

61

62

classSimpleEvaluator(parsimonious.NodeVisitor):

def__init__(self,ctx,strict=True):

self.grammar=parsimonious.Grammar(GRAMMAR)

self._ctx=ctx

self._strict=strict

defvisit_name(self,node,chidren):

ifnode.text inself._ctx:

val=self._ctx[node.text]

ifisinstance(val,(six.string_types)+(six.binary_type,)):

val=decode(val).lower()

returnval

elifself._strict:

raiseEvalError('Unknown variable %s'%node.text,node.start)

else:

return''

defvisit_literal(self,node,children):

returndecode(children[1]).lower()

defvisit_chars(self,node,children):

returnnode.text

defbinary(fn):# @NoSelf

def_inner(self,node,children):

ifisinstance(children[0],bool):

raiseEvalError('Variable is boolean, should not be used here %s'%node.text,node.start)

returnfn(self,node,children)

return_inner

@binary

defvisit_contains(self,node,children):

returnchildren[0].find(children[-1])>-1

@binary

defvisit_equals(self,node,children):

returnchildren[0]==children[-1]

defvisit_expr(self,node,children):

returnchildren[1]

defvisit_or(self,node,children):

returnchildren[0]orchildren[1]

defvisit_more_or(self,node,children):

returnany(children)

defvisit_and(self,node,children):

returnchildren[0]and(Trueifchildren[1]isNoneelsechildren[1])

defvisit_more_and(self,node,children):

returnall(children)

defvisit_not(self,node,children):

returnnotchildren[-1]

defvisit_bracketed(self,node,children):

returnchildren[2]

defgeneric_visit(self,node,children):

ifchildren:

returnchildren[-1]

This class evaluates expression within context of defined variable’s values (dictionary). So we can parse and evaluate expression with one method:

Conclusion

Parsimonious is a nice compact package for creating small parsers that are easy to use. For myself I bit struggled with PEG grammar, but that’s probably due to my unfamiliarity with this type of grammars. Once accustomed to it one can create more complex grammars.