I've looked into Parse::RecDescent and it seems to be ideal, but it's a complicated module and none of the tutorials I've looked at have an example dealing with nested grammar. If I could get it to work though, it looks like it would be easier to extend than the other solutions I had in mind involving a loop and either a regular expression for finding childless blocks -- or a floating reference that moves up and down the data structure as we process each line. (I've done those sorts of things before, and the code was always unreadable afterwards.)

However you decide to translate the format, you will want to have a good description of the format, which is provided by a grammar. Once you have a grammar, converting into a format suitable for P::RD should be fairly straightforward.

From your example format, a coarse version of the grammar would be something like

I enjoy using YAML for configuration information or other external data. It's advantages are that it is easy to read and write with any text editor, the syntax is fairly intuitive (it looks a lot like outlines done in email), and there are decent parsers and emitters for many common languages, including Perl (YAML).

One of the very nice things about YAML is that it is a direct representation of scalar, array and hash data structures, nested arbitrarily. This means that the language is rich enough to represent any data structure that you can in Perl (or Python, Ruby, JavaScript, PHP, etc.). It also means that you don't have to perform interesting data contortions (or use objects, tree structures, etc.), as you might with XML. Rather, you use the structure directly. It's an AoHoAoH, or what ever you like.

the two tokens 'page foo' generate an event new_type("page","foo")which creates a new elem as a child of the element at the top of a stack

The token { puts the last child of the top of the stack on the top of the stack

the } token pops an element from the stack

anything that is not recognized as a "new thing" structure ((\w+)\s+(?:(\w+)\s+)?\{) is globbed up, and passed to the 'character_data' event, in your case, probably one per line

the event handler has a 'root' element predefined, at the top of the stack

use something like the event parser to convert the language with no state into XML or YAML or whatever, and use a parser for that

use ??{ } in regexes in a similar manner to the event parser handler. If you're going that way, you can nest expressions using ??{ }. See perlre for some devious tricks you can do with this construct. /msg me if you would like me to post an example.

Update: it's done. it was fun, but don't use it. Someone below implemented the event parser I was talking about, just not in a decoupled OO kind of way.

My initial reaction would be that pagination is a secondary concern, and that the list of questions is the root of the matter, so I'd start with an array of hashes -- or, if the index-naming of the questions is "non-linear" or "semantic" in some way (i.e. not just an ordered list, but a set of distinctly named entities), then make it a hash of hashes.

In any case, the "outermost, primary" unit of organization is the "question", and features of each question are simply:

its position in the sequence (or its ID/name, which is presumably sortable in some way)

As for parsing the input text to fill that structure, there are numerous ways, and Parse::RecDescent would certainly do it (but it might be overkill -- other ways would suffice and be easier if you're really green with P::RD).

Yes -- as long as your format doesn't have any more complexity that you haven't told us about yet.

There's a general approach to tree-building that's applicable here: Keep a stack of your "active" container, and every time you find a line with a "{" on the end of it, push a new container on to the stack. On lines with a "}", pop the active item off the stack. Every other piece of data that we find can get pushed into whichever container is currently active.

Of course, if your format includes escape characters, multi-line elements, or other complications, you may need to use one of the industrial-strength parser-generators... But for a simple format, we can roll our own.

Take a look at the output of the below, with and without $fewer_indents set true, and modify as desired.

Assuming all of your data is as clean looking as this example (ie: everything indented nicely, all questions exactly one line, etc..) then written a sequence of regexes to convert this to YAML -- or even directly to perl code -- should be pretty straight forward.