Phase 3 is, to a certain extent, catered for by SWI OS_SubstituteArgs, but there appear to be no native RISC OS resources for phase 1. The PCRE (Perl Compatible Regular Expression) library is rather large and not all that efficient compared to the leaner and more modern PEGs (Parser Expression Grammars), which have only become more popular since 2004. If one were to have a relocatable module providing a PEG virtual machine what would its interface look like? Perhaps an SWI with entry:

R0 pointer to string to be analyzed,
R1 pointer to PEG pattern,
R2 pointer to buffer for captured indices
R3 length of buffer

and exit:

R0 number of characters matched.

The PEG pattern syntax would include symbols indicating that the current position in the string is to be captured.
Would such a resource be of use to anybody? Is this a sensible project?

The notion of a PEG has been around, under different names, since the 1970s. Birman (1970/73) called it a TMG Recognition Schema, where TMG stood for Techniques and Methodologies Group. Aho (1972) called it a TDPL or Top Down Parsing Language. The name PEG was coined by Bryan Ford (2004), who revived the concept and pointed out its advantages.
PEGs do have considerable advantages over Regular Expression Grammars for most computing problems. They run faster, they recognize a wider class of strings, and they require far fewer lines of code to
implement. The source code for the PCRE (Perl Compatible Regular Expression) library is nearly a megabyte in size. That for Lpeg is 50K. The trouble with the traditional pattern-matching syntax is that it conflates patterns with strings, in which certain characters, the magic characters, carry a special meaning. That means, because those characters are also needed in their unmagical sense, that an escape-character convention is needed; and that is a recipe for unreadability. Lpeg has a much cleaner syntax because it recognizes that patterns and strings are different datatypes.

So PEGs, though around since 1970, did not gain much traction till 2004, long after the early days of RISC OS. I noticed recently that they seem to be becoming more popular, and as they look easier to code than RegExes, maybe this would be an interesting project for an ARM assembler enthusiast? There are one or two papers detailing the instructions for a simple virtual machine for parsing, with state given by an index into the text to be parsed, an index into the VM instructions (program counter), a stack for saving indices when backtracking is needed, and a boolean for success/failure.

I am writing up some notes on pattern matching which I will put on my website soon. The idea at the back of my head is a library of BASIC assembler routines. The patterns are Parser Expression Grammars – slightly different from regexps because alternatives are prioritized. Thus

A|B
match with A and if that fails try B

. I am trying to be as neutral as possible about concrete syntax. The only way to do this sort of thing is to have a simple virtual machine, and to optimize the code at the VM instruction level, before translating to ARM assembler. Only very few ARM instructions would figure, so I guess any old language could be used, for the compiling. However when it comes to capturing information from a pattern match, there a lot depends on the ambient language, as different languages support different datatypes (even if they share the same name :). For a really minimal type of capture I am going for a single pattern which always succeeds and as side-effect pushes the current text-pointer onto a stack. On exit from the match the stack would be exported; basically one captures where things match, rather than what , which one can do afterwards anyway by extracting a substring.
Actually to implement this project requires making more decisions; notably the choice of VM.

I think I see what you are getting at. But a pattern is like a program source; it must be compiled or interpreted to act on a piece of text. I am not too bothered about the concrete syntax – that is window dressing. But the abstract syntax is another matter. That has to be fixed before one sets out. Simpler to test out a pattern matcher as a standalone application first. Too complicated to jump in with a relocatable module until an application is sorted.

I read a bit about PEG and it seemed to me to be a case of swings and roundabouts in comparison to regular expressions. Nevertheless, an implementation of PEG is going to be interesting.

The problem of the various regexp systems already in RO programs, and their incompatible syntax, is real for the user. Because they are integral, I am not sure that there is scope to unify them in the way proposed. It is a case of: “I would not start from here”.