Pyparseltongue: Parsing Text with Pyparsing

Text Parsing Tools

Some people, when confronted with a problem, think “I know, I’ll use regular expressions.” Now they have two problems.

–Jamie Zawinski, 1997

I don’t actually agree with Mr. Zawinski – I’ve been using regular expressions successfully for over two decades, and I have done a lot of useful work with them. However, I do admit that they are cryptic and tricky. Here is a regular expression to parse a string like “Ja. 15, 2014″ or “Au. 27, 1990″:

\b([A-Z][a-z]{2})\.?\s+(\d+),\s+(\d{4})

If you want to retrieve the month, day, and year by name rather than numeric index, it would look like this:

\b(?P<MONTH>[A-Z][a-z]{2})\.?\s+(?P<DAY>\d+),\s+(?P<YEAR>\d{4})

As it turns out, there are other ways to parse text. You could create a tailor-made parser than iterated over the text character-by-character, with logic for finding your target. This is tedious, error-prone, time-consuming, and no one does it.

You could use string methods such as split(), startswith(), endswith(), etc., to grab pieces and then analyze them. This is likewise tedious, error-prone, and time-consuming, but some people do take this route because they are scared of regular expressions.

A better option is the pyparsing module. This article describes how to apply pyparsing to everyday text parsing tasks.

About PyParsing

Pyparsing (https://pyparsing.wikispaces.com/) is a Python module for creating text parsers. It was developed by Paul McGuire. Install with pip for most versions of Python. (Note: The Anaconda Python Bundle, highly recommended for serious Python developers, includes pyparsing.)

First, you create a grammar to specify what should be matched. Then, you call a parsing function from the grammar and it returns text tokens while automatically skipping over white space. Pyparsing provides many functions for specifying what should be matched, how items should repeat, and more.

Note: the examples are written with Python 3, but should work identically in Python 2 if you convert the print()function back into the printstatement.

Defining a Grammar

The first step in using Pyparsing is to define a grammar. A grammar defines exactly what the target text can contain. The best way to do this is in a “top-down” manner, specifying the entire target, then refining what each component means until you get down to literal characters.

The usual notation for grammars is called Backus-Naur Form, or BNF for short. You don’t have to worry about following the rules exactly with BNF, but it is convenient for describing how things fit together.

The basic form is

symbol ::= expression

This means that symbol is composed of the parts specified in the expression. For instance, a person’s name can be specified as

This means that a name consists of a first name, an optional middle name (brackets indicate optional components), and a last name. A first name consists of one or more alphabetic characters, as does a last name. The plus sign means “one or more”.The pipe symbol means “or”. Pyparsing has predefined symbols for letters, digits, and other common sets of characters.

A Few More Niceties

There are four functions that you can call from a parser to do the actual parsing.

parseString – parses input text from the beginning; ignores extra trailing text.scanString – looks through input text and generates matches; similar to re.finditer()searchString – like scanString, but returns a list of tokenstransformString – like scanString, but specifies replacements for tokens

Let’s say you had a configuration file that looked like this:

sample.cfg

city=Atlanta
state=Georgia
population=5522942

To parse a string in the format “KEY=VALUE”, there are 3 components: the key, the equals sign, and the value. When parsing such an expression, you don’t really need the equals sign in the results. The Suppress() function will parse a token without putting it in the results.

To make it a little easier to access individual tokens, you can provide names for the tokens, either with the setResultsName() function, or by just calling the parser with the name as its argument, which can be done when the parser is defined. Assigning names to tokens is the preferred approach.

How to Parse a URL

URLs are, of course, frequently used in everyday life. Without them, the Internet would be a vast mishmash. Oh, wait. It already is. Well, without URLs, the vast mishmash would have no directional signage. In this section, I’ll show you how to parse a complete URL.

Taking Action

Any parser (including the individual parsers that make up the “main” parser) can have an action associated with it. When the parser is used, it calls the function with a list of the scanned tokens. If the function returns a list of tokens, it replaces the original tokens. If it returns ‘None’, the tokens are not modified. This can be used to convert numeric strings into actual numbers, to clean up and normalize names, or to completely replace or delete tokens. Here is a parser that scans a movie title starting with “A Fistful of “, and uppercases the word that comes next:

Conclusion

Pyparsing is a mature, powerful alternative to regular expressions for parsing text into tokens and retrieving or replacing those tokens.

Pyparsing can parse things that regular expressions cannot, such as nested fields. It is really more similar to traditional parsing tools such as lex and yacc. In other words, while you can look for tags and pull data out of HTML with regular expressions, you couldn’t validate an HTML file with them. However, you could do it with pyparsing.

ii. Why didn’t I say URI, which is more technically correct? Because “URL” is the most common way to refer to an address like ‘http://www.python.org“. URIs are a little more generic than URLs, and I wanted to keep things simple.

iii. For brevity, I only included a few common values for the scheme, leaving out nearly all of the over 200 scheme values registered with the IANA.

2 Responses to "Pyparseltongue: Parsing Text with Pyparsing"

Here is a nice bonus for defining results names. In your little config sample, you used this statement to print out the key/value pairs:

print(“{0} is {1}”.format(result.key, result.value))

Since pyparsing’s ParseResults class qualifies as a mapping, you can also write this:

print(“{key} is {value}”.format(**result))

(This will give you a KeyError if the named item is not present though, whereas your use of result.key and result.value will always succeed. Why? getattr for undefined names in a ParseResults will return ”, but getitem raises a KeyError.)

I also recommend that people use ParseResults.dump() method to show the list of items, followed by a nested bullet list of the various named elements that are available in the parsed structure. Extending your example:

print result.dump()

prints:

[‘population’, ‘5522942’]
– key: population
– value: 5522942

This would also simplify your testing code for url_parse.

On your movie title example, you have a minor typo in prefix, it should read:

prefix = ‘A Fistful of’ + White()

I’m never fond of having to include White() expressions in my parsers, although sometimes there is no avoiding it – I think here you are doing it to satisfy Combine’s default requirement that all given expressions be contiguous, and that they will be returned as a single string. Your solution is perfectly valid, but let me present an alternative:

By Letsika January 18, 2017 - 8:52 pm

I am doing my project and using pyparsing module. firstly,I have written BNF for C++ programming language,secondly I wrote the code/program for BNF using python,specifically pyparsing. So now I am having a problem in parsing the c++ source code because when the program does not match what is being parsed,it terminates and no longer continue to check other mismatch. but if the parsed source code is correct , it runs to completion. how can I list all the errors where there are mismatch?