Introduction

If you want to extract data from strings (like HTML, TXT, CSV, etc.) consider Parse.
If you want to just check some user data against a specific format consider using Parse.
If you want to validate some message written in your new dialect use Parse.
Parse is useful.
Parse is quick.
This document is a very rough show by example description of Parse with a few warnings thrown in. I recommend you read it in conjunction with the section (section 14) on Parse of REBOL Core User Guide.

Comments to brett at codeconscious.com please.

The two modes of parse

Parse actually has probably about five modes of operations but two general categories stand out:

Parsing character string input in a string breakapart mode.
Parsing character strings or blocks using rules (parse dialect).

Breaking strings apart is a useful function of parse, it is a handy utility.

Parsing input according to rules is a more sophisticated use of Parse. In this use of Parse you are more likely to be interpreting the input in some way, overlaying it with new meaning. That is you have a string or block and you are perhaps identifying fields of records, or tokens of a language, or even identifying sections of a message protocol.

Parsing character string input (or binary data)

First up you have to decide whether you want Parse to handle whitespace characters for you or whether you will handle them yourself. Parse will handle the whitespace for you by default. If you specify the /all refinement
REBOL's whitespace handling will be turned off. You must use /all to get correct results if your character data is actually binary type data.

What characters constitute whitespace? Here's my list. I generated it by using a function that plugged each into parse and checked for an effect. Could be wrong but should be pretty close:

{^A^B^C^D^E^F^G^H^-^/^K^L^M^N^O^P^Q^R^S^T^U^V^W^X^Y^Z^[^\^]^!^_ ^~ }

Most of these are control characters, however you can also see the usual suspects of tab #"^-", newline #"^/" and space #" ".

Operations on strings

Parse can operate on your data in one of three ways. Here are the modes using my own terminology to describe them:

String Breakapart mode - default delimiters

With this use of parse, all you do really is supply a string and parse will break it
up according to predefined delimiters. These delimiters (I believe) are:

delimiting-characters: {",;}

In this mode double quoted text is recognised. So that if parse encounters a double quote it will delimit the text at the next double quote instead of say breaking at a comma.

If you omit the /all refinement then whitespace will also be considered as a delimiter.
Doing this can be useful for working with some types of delimited files such as CSV (though to deal properly with CSV files as exported by MS Excel requires more work).

Parse Dialect

In this mode you give parse a rule block containing instructions to follow. These instructions allow you to utilise parse to interpret custom external formats or protocols. These instructions can be as simple or as complex as you need. A simple example would be to check that some input against postal code format. A sophisticated example is REBOL's XML parser. It uses this mode of parse to load in simple XML documents. I've used this mode of parse to interpret MIME format email messages.

The instructions are written according to the parse dialect. The instructions tell parse how to read through your input. In actual fact, the instructions describe the patterns that the input should take. Parse attempts to match the input against your patterns. Parse will return a TRUE result if your instructions accurately describe the input. If your instructions fail to describe the input (or looking at it the other way, the input fails to follow your rules) parse will return FALSE. You also have the ability to carry out normal REBOL operations as parse traverses the input and your rules.

It is very important to realise that the words of the Parse dialect are interpreted by Parse in a specific way and should be considered as being different in meaning to REBOL words when used at the console.

Through this description/tutorial thingy I'll use examples assuming a string input.

Let's start at the end

>> input-string: {}
>> parse input-string [end]
== true

Ah success! Here I am parsing an empty string. My rule says to parse "check that we are at the end". The result is of course TRUE because the string was empty to begin with.

This is similar in normal REBOL script to:

>> tail? input-string
== true

Baby steps

We successfully tested that the input started with "fox" and then finished. Ok no big deal.
But reflect a moment. This is a sequence - first "fox" then END. As parse traverses the input
and your rule block, it keeps track of a current position for both. So at the start, the current position in the input is at the head of the string. After the rule "fox" was matched the current position in the input string will be directly after the "x" of "fox".
In this example, this happens to be the tail of the string, so the very next match rule END will succeed.

We do not always have to supply an END in the rule block. You can omit it in the last example because Parse effectively slaps one on at the end anyway.

>> parse input-string ["fox"]
== true

While you can do this for simple examples, remember you'll likely need to add it in
explicitly for more complex rules.

Ok back to the example again. In an ordinary REBOL session the above example is similar to the following:

Note that the ordinary REBOL code examples through this article are provided to help learn PARSE. There are enough important differences between the Parse examples and the ordinary code examples that you cannot alway treat them as exactly equivalent.

Failures / challenges

The meaning of this is pretty obvious. Hang on though, what actually happens when Parse encounters a failure with one of the rules? Well it backtracks the input to the point it was at when the rule started. So in REBOL code what happens is actually more like this:

Now, REBOL can be pretty concise and the ANY function definitely helps in writing concise code, but you can see already that the parse dialect is looking to be better suited to matching than ordinary scripting.

Reflecting on this a bit. We have here a more interesting rule. In fact we have a compound rule. Our compound rule is composed of three sub rules. Each of the three sub rules here are very basic but they are allowed to be compound rules themselves. The basic rules perform the lowest level matching of the input, the compound rules check the overall pattern (structure/grammar) of your data.

Back to options. What about something that may or may not exist at all? Using OPT we can indicate that the dog could be black or just leave it out:

Repetition - known range of occurrences

Time for some more compound rules.

Here's how to check for exactly two dogs.

>> parse "dog dog" [2 "dog"]
== true

Pretty cool eh? You can check for exactly 30 dogs in the same way. Hang on, you may object, there's a space in between the two dogs! True, but whitespace handling is in effect. If you use the /all refinement whitespace handling is not used and the space becomes a valid character to check for:

>> parse/all "dog dog" [2 "dog"]
== false

But now we don't just have two dogs, we have a dog a space and a dog:

>> parse/all "dog dog" ["dog" #" " "dog"]
== true

For the rest of these introductory examples I'll leave whitespace handling on.

Excellent, we have some prawns but we don't know how many.
The SOME keyword means "match one or more of the following". Again it is a compound rule because I could have as easily done this if it was "raining cats and dogs":

Here then also is an example of one of those REBOL words with a new meaning in the context of Parse. In ordinary REBOL ANY is a function that return the first non-false or non-none value in the block it is given. In Parse, by contrast, ANY is a keyword that introduces a compound rule that means, "match zero, one or many of the following".

Repeated Repetition

Now that I've introduced repetition and compound rules, what happens if I create a compound rule made up of nested repetition rules? Hmm, tricky.

This next example put Parse into a spin - an infinite loop. The escape key will not work - only try it if you know
how to kill a process using your operating system (e.g in NT4 use task manager). A version you can quit with the
escape key will be given later:

input-string: {}
parse input-string [ any [ any "dog" ] ]

To understand this infinite loop happens you need to know when the ANY rule returns success and when it completes.

Here's the major answer: ANY ALWAYS returns success.
ANY will keep calling its subrule while that subrule returns success. ANY gives up on receipt of bad news (failure) but it itself always returns success. Now if ANY always receives a success because it's subrule in fact is another ANY... Well I think that explains it.

Remember OPT. It always returns success just like ANY. So putting an OPT inside an ANY is bound to lead to trouble as well.

The point then is that your repetition compound rules must be carefully written because of the possibility of creating these infinite loops. It is not a bug in REBOL, it is consequence of having a flexible parse dialect.

Sometimes these infinite loops start only after traversing lots of other complex rules and therefore can become hard to catch. I create these loops less often now since I started considering how I want Parse's "point of attention" to move. When writing your rules consider how the input is consumed by the rules.

That's part of the reason why I've been demonstrating the REBOL code similar to the various Parse examples.

This last example is ok because the SOME does not always return success. If SOME does not have at least one success it
returns a failure result. So you can see that at some point, given that we can assume that the input is
finite, the overall rule must terminate.

Quoting Ladislav, "The dangerous rules are rules, that don't consume any input, yet they return success."

REBOL version based on Core 2.5.3 and later have another way to handle this infinite loop scenario - the BREAK keyword.
BREAK terminates the rule when it is encountered. See the REBOL change documentation for examples.

Nothing here much

Check this:

>> parse {} [none]
== true

The NONE keyword does nothing but is always successful. Other than that, you may as well forget it until you really need it. Oh and wrap a NONE within an ANY or a SOME and you get....lots and lots of wasted CPU cycles.

All these characters

Charset. Stands for character set. It is a bitset which I believe makes it fast
for pattern matching operations.

Let's say you only want to check that your input contains the digits 0 to 9.

digit: charset [#"0" - #"9"]

Now parse can use this directly as a pattern matching instruction. It will match one character
only of those in the set 0 - 9.

>> parse {1} [digit]
== true

Naturally enough you can use these in compound rules too:

An Australian postcode consists of 4 numeric digits so:

parse {2069} [4 digit]

Maybe you want everything but digits:

non-digit: complement digit
>> parse {1} [non-digit]
== false

Charsets (bitsets) are sets and you can apply the set operations union, intersection, exclude, etc
on them:

This says "match 123, move to the tail, test tail". Pretty obvious we would get a true result if you think of it in these terms.

While we're here how about another repetition warning. The rule [to end] moves to the tail and reports
success every time. Put an ANY or SOME around it and you can guess what will happen (hint read repeated repetition section repeatedly).

But I want some information from it!

Up to this point I've concentrated on the various matching functionality of Parse. Of course though you want to extract information from your data. The keyword of note for this purpose is COPY. Also of use is the ability to execute REBOL code within the parse rules and thereby set and maintain REBOL variables (eg. Counters) using that code.

Ok COPY.

Copy is really really simple really. It is a compound rule that takes two arguments a variable and a subrule. Whatever input the subrule matches gets copied into the variable. If the subrule doesn't match anything (fails) COPY returns the failure but leaves the variable unchanged.

And here the subrule is to match nothing NONE which is always successful so copy copies that which was matched... Well perhaps it should have been an empty string, but this is what happens (at least in REBOL/View 1.2.1):

>> parse "123" [copy some-text none]
== false
>> some-text
== none

Bring on the code

Ordinary REBOL code can be used inside the parse dialect via the use of "(" and ")" i.e. a Paren! series:

So the upshot is you can maintain counters and take actions based on your parse rules.

Another interesting use for the Paren! is to enable the Escape key to work
in the infinite loop situation described earlier by adding within the looping part.

Taking the earlier example and adding a Paren! to it gives:

input-string: {}
parse input-string [ any [ () any "dog" ] ]

This will loop spin around until you hit the Escape key (Esc).

So during development it might be useful to put print statements in these allowing you to
see what is happening and use the Esc key if necessary. Note though it is possible this
behaviour could change in later version of REBOL.

The current index and manipulating it

Parse maintains a reference to the input. The reference is a series and so has a current index.

Some special parse dialect syntax allows you to get and set this reference. In fact you use a set-word and get-word syntax respectively.

In this example I set the word "mark" to the input series at the current index that parse has, don't worry about the false - it is just saying we didn't get all the way through the input:

To explain. First "123" is matched, then the word mark is set to the reference.
Then the REBOL code between the parentheses is evaluated. This code manipulates the reference we hold by two characters. I return this modified reference to parse using the get-word syntax. Parse seeing the get-word syntax knows that it must update it's reference to that given. Finally I match the "67".

Parsing loaded values

This mode is used when the value to be parsed is actually a block not a string. You use this mode when you have already loaded data into REBOL values. You write parse instructions in a rule block using the parse dialect in a similar way to that described for parsing strings except for parsing blocks the semantics are different and you have a couple more keywords to use.

This is the mode of parse that deserves the attention of anyone using REBOL. The reason is that you are free to store your data in a form understandable by yourself and others and yet is still computer readable.

An example that shows what can be achieved is Carl Sassenrath's stock transaction example which you can see below. Now what if "sell 300 shares at $89.08" came in via email?

If you study this example you will see that Carl, in a very small space, has created
a small interpreter that parses, validates and performs computations. This is
very powerful technology that is easily underestimated because it is so small and simple.

Another powerful example of this is the VID dialect of REBOL/View. VID describes in a
effective but simple way what should appear on screen. VID is actually a block using normal
REBOL values such as words and strings. The LAYOUT function of REBOL/View takes a VID
block as an argument to construct the visual objects. Layout uses parse to process the
VID specification.

Special situations

When you do NOT want to match a pattern

One situation where you might do this is when you have a sub rule that might
"consume" something needed by an enclosing rule.

I have come across this sort of problem a few times and I thank Ladislav for
showing me a solution.

For my example, I'll parse a block rather than text but the concept still
applies. I want to parse the following block, and print
out every word, but if I encounter a "|" I'll print out the text
"**********":

my-block: [ the quick brown fox | jumped | over the lazy]

This next bit of code will not work. If you try it you will see that there
are no "*"s printed, instead you will see the "|":

The thing to note is that "|" is a word too. Therefore the "|" is "consumed"
by the rule called SINGLE-WORD. So one way
to solve this is to give SINGLE-WORD some indigestion (make it fail) when it
encounters a "|". To do this I will use a dynamic rule - a rule that is
modified as parse is executing.

To force a rule to fail, make sure it cannot match anything any more. A way
to ensure this is to try a skip after the end of the input. This can never
work, if we are not at the end it will fail, if we are at the end then
the skip will fail. So this rule is guaranteed to fail every time:

always-fails: [end skip]

Using this I now wrap SINGLE-WORD with a rule I call WORD-EXCEPT-BAR. The
purpose of this new rule is to fail if it finds the "|" word otherwise it
goes ahead and runs SINGLE-WORD. I also need to modify PHRASE to call
WORD-EXCEPT-BAR: The dynamic rule I mentioned earlier is called WEB. Here
are rules with the complex one split over multiple lines to improve
readability:

The BREAK keyword

From RT's changes document:

When the BREAK word is encountered within a rule block, the block is
immediately terminated regardless of the current input pointer.
Expressions that follow the BREAK within the same rule block will not
be evaluated.

In this example the SOME rule is exited early:

>> parse "X" [some [ (print "*Break*") break] "X"]
*Break*
== true

Here again the SOME rule is exited early just like the previous
example. In this case the rule that SOME is processing is referred to by a word:

This case produces an infinite loop. Because the BREAK is within a sub-rule
of the rule that SOME is processing. The BREAK does not affect success/failure
status or the input pointer - it just exits a rule early:

Related toolset

I have written "Parse Analysis Toolset" to help learn and analyse the way Parse works. The Explain-parse function of the toolset should help with learning Parse. The script has related documentation. You can find the script and a linkg to the documentation at:

Comments

Parse is a key component REBOL. REBOL is promoted as a messaging
language. Messages can come in many formats (syntaxes). Parse allows
you to define the syntax of a message so that you can interpret the message and transform
it to something else or act on it directly. That may sound complex, but it isn't really.

What are messages? Lots of things can be considered as messages. Basically if you can
put it into a file and the format of the file has some rule to it, then I think you have
a message. You don't have to put it in a file though to use parse. REBOL's networking
functions use parse to interpret many of the internet protocols that REBOL provides
access to.

With REBOL you can define a mini-language (a language designed for a particular purpose).
Parse helps you to validate and process such mini-languages. You might want to design
a mini-language for creating web pages on your internet site. Or perhaps for controlling
a special device you have attached to your computer.

Even if you don't go this far, parse's delimit mode will be useful for you just as a
string-breakapart utility.