Elgon has asked for the
wisdom of the Perl Monks concerning the following question:

Hi Folks,

I've got a dandy little regexp-related problem for you all: I am writing a little module which takes a molecular formula and converts it into a hash where the keys are a unique list of elemental constituents and the values are the number of atoms present in the molecule. Sounds easy - believe me, if you're as naff as I am at regexps it ain't!

So we have our formula in $formula...
First we want to get rid of bracket pairs without coefficients next to them, so I though something like this...

1 while $formula =~ s/(\()(\[A-Za-z0-9()]+)(\)\D)/$2/e;

but this can be wrong as in some cases the maximal matching will chop out brackets which don't match...Help!

Then we want to swap out brackets which are followed by a two or more (if they're followed by 1 as a coefficient - and they shouldn't really be - then they effectively don't have a coefficient and should just have the brackets removed...) In this case we should multiply the what's inside the brackets when we multiply them out (which the following may or (more likely) may not do!)

1 while $formula =~ s/(\()(\[A-Za-z0-9()]+)(\)\)([0-9]+)/$4x$2/e;

Once these two tasks I have got a way of doing the rest but I cannot work out the correct regexps to do the above tasks - I just don't have the knowlege, the experience or a copy of "Mastering Regular Expressions"!

Just to clarify, if we have the following formula...
Mo(PH3)4(CO)(NH2C2H5) for example, it should become... Mo(PH3)4CONH2C2H5 after the first regexp and then MoPH3PH3PH3PH3CONH2C2H5 at the end, which I can parse nicely myself. Note that if you have a series of brackets... (...(...)...(...)...) they need to be processed in the correct order, which really has me scratching my head I can tell you.

I will bow in deep respect to anyone who can give me a hand on this one as it has got me a bit stumped. (For the record it is not for an assessed piece of work - I am a chemist after all - but a mixture of general interest and boredom.) Virtual beer to you!

"Without evil there can be no good, so it must be good to be evil sometimes.
--Satan, South Park: Bigger, Longer, Uncut.

I think you've got the right idea using 1 while s///, because you're matching from inside out rather than left to right. Here's one way to do the whole substitution all at once:
1 while s/\(([^\(\)]+)\)((?:\d+)?)/ $1 x ($2 || 1) /ge;
This matches a parenthesized substring that does not itself contain any parenthesizes, and optionally a subsequent number, and replaces it with the substring, minus the parentheses, repeated the appropriate number of times.

I'll go away and play with these suggestions, which seem quite groovy (insofar as I can tell which ain't that far!) The reason for all of this is sort of related to my final-year project but not actually included in it (the project is in PHP): My tutor wrote a routine to do this kind of thing, which took him ages in some other language and I'm trying to introduce him to the power of Perl (and by extension, Perlmonks.)

In the first line we are replacing any parens and their contents (which may not include parens) if followed by a non number with just the contents $1 and the non number $2. The |$ is to get a paren as the last character of a line.

Do a while loop on $formula =~ s/\(([A-Za-z0-9]*)\)/Q$i/. This will only get any inner compositions (no addition parens). You'll replace these with Q1, Q2, etc. (or if you're worried about more, you can use Qa, Qb, or QA, QB, etc. since I suspect you're considering chemical symbols with no more than 2 letters). $1 will capture the inner composition, which you should work out and associate in a hash with the Q variable. Note that you remove those inner parans when you do this.

The next time around, if there are still more parans, you'll capture those; Now you can consider the Q series and do any necessarily multiplication from those as well.

Once you exit this while loop, you'll have no more parans, so you can calculate the final composition with no problems.

-----------------------------------------------------
Dr. Michael K. Neylon - mneylon-pm@masemware.com
||
"You've left the lens cap of your mind on again, Pinky" - The Brain
"I can see my house from here!"It's not what you know, but knowing how to find it if you don't know that's important