If you’ve ever done any serious programming you’ll have run into something called regular expressions:

... (abbreviated regex or regexp and sometimes called a rational expression) is a sequence of characters that define a search pattern, mainly for use in pattern matching with strings, or string matching, i.e. "find and replace"-like operations. … Regular expressions are so useful in computing that the various systems to specify regular expressions have evolved to provide both a basic and extended standard for the grammar and syntax; modern regular expressions heavily augment the standard.

In the above example, $string1 is the string we’re searching and “m/…../“ is the regex we’re searching for. The regex matches any character except a newline. Each “.” stands for any character and as there are five of them, the characters in the input string are matched in sequence. Now, as there are more than five characters in the input sting than can be matched by the regex (which only tries to match five) the result is true so the print statement is executed.

In that example you get a sense of how powerful regexs are but there’s a downside: The specification of a regex for anything beyond simple searches gets extremely complicated. For example, the regex:

\b\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}\b

… will match any IP address just fine, but will also match 999.999.999.999 as if it were a valid IP address. To restrict all 4 numbers in the IP address to 0..255, you can use the following regex. …

But even that regex has problems (it rejects IP addresses with following port numbers (such as ":21") because it expects the IP address to be surrounded by white space. If you wanted to allow for that as well as following punctuation such as "?" and "." but not "-" or "&"then you're going to have to work a little harder. Oh, and what about preceding punctuation such as ":"?

... a working computer program from a high-level problem statement of a problem. Genetic programming starts from a high-level statement of “what needs to be done” and automatically creates a computer program to solve the problem.

Not surprisingly, genetic programming has been used to "evolve" complex regular expressions from data and the Web site Regex Generator++ demonstrates the technique. This service, created by Prof. Alberto Bartoli, Giorgio Davanzo, Andrea De Lorenzo, Prof. Eric Medvet, and Enrico Sorio, researchers at the University of Trieste, “evolves” regexes that have the best fitness in terms of effectively matching strings based on examples. They explain the methodology used in their paper Automatic Synthesis of Regular Expressions from Examples.

To demonstrate their technique the team has created a service you can experiment with: To begin the process you provide training examples that contains the target strings which you identify by highlighting (make sure you use plain text otherwise it will barf).

You need at least two training examples with at least 25 matches identified in each (my test training sets for IP addresses are below). You then click “Evolve!” and about 15 to 20 minutes later, a JavaScript-compatible regex is displayed (regexes come in different formats for different languages and platforms).

Those of you with good regex-fu will immediately notice that it isn’t optimized, it’s really hard to understand, and, when tested at Regular Expressions 101, it doesn’t find all the IP addresses in the first training example.

The evolved regex found only 16 of the 25 matches so more evolving is needed or, quite possibly, better examples (I kludged up my examples and for serious applications you’d be advised to consider your examples very carefully as flawed data will cause the resulting regexes to be less fit in real world contexts). I tried the example data with the first IP matching regex above; it scored 46 matches but 16 weren’t valid (such as 8.8.888.9 and 899.0.0.1). Using the second regex, 28 matches were made so it still didn't capture all of the matches,

What this technique demonstrates (although rather better through the examples on the Regex Generator++ site than my example) is that complex algorithms that humans would find very hard to create can be quickly and easily generated and then evolved to perform better.

De Lorenzo, Medvet, Bartoli, Automatic String Replace by Examples, ACM Genetic and Evolutionary Computation Conference (GECCO), 2013, Amsterdam (Netherlands)—the string replace functionality described in this paper is based on an extension of the work showcased on this web app; it is currently not exposed on the web.

Note: The Regex Generator ++ service can be very slow to respond at times; it’s running on hardware that appears to often be seriously overloaded. Also note that all input MUST be plain text; the service doesn’t check for illegal content and will throw unexplained errors.