I'm parsing a file that contains 25 columns per line and each column might have it's own way of validation. Either it could be whitespace or a full char-set.

That validation is required since I have to drop away the line that doesn't match that kind of part.

Might not be necessary

It's necessary

I don't believe that the expression it's badly constructed, the lookahead it's used, maybe in the part that I repeated the code (I just don't remembered the capturing group index \1...\n, if this is what you mean!) I also believe that catastrophic backtracking is happening here.

If you see the file, maybe you'll understand why I'm doing this! Let's put as an example the first column. I have to match a "Part Number" and I don't have any rule of how to do this, examples:

123456789

1 555 989

0123456789123456789

Neither a simple \S+, or (\S+\s){1, } could solve this problem, Cause I won't be guaranting data integrity.

The regexp doesn't change in the loop, so pull the lexer = assignment up and out of the loop. ("Loop invariant code motion.")

I'm not sure if this will be faster or slower but the repeated subexpressions could just be a single outer repeating group.

So this is obvious, I'm sure, but it might make sense to write a real parser. Ruby has an advanced PEG parser generator called Treetop. The traditional approaches are either Lex + Yacc (such as flex + bison) or just a C program implementing a scanner and a recursive-descent parser. Old-school, yes, but easily written and kind of a fun change from yet another CRUD program.

I agree with the other responders that the expression is unnecessarily redundant, complicated and should probably be broken up into at least two separate and smaller ones. Two expressions could be interesting because they could both be oppositely anchored.

* It was a mistake to put that code inside the loop, and moving the regular expression outside the loop made no diference. * It was necessary and even more readable for editing * That's a good idea, I'm just going to try it out!
–
EderMay 27 '11 at 20:17

As far as I can tell, there are only a few significant elements that you're looking for in the line.

You're looking for a alpha-numeric that starts with Z. (Z(?:RO[A-DHJ]|EQ[A-C]|HIB|PRO|PRP|RMA)|H(?:IB[2E]|ALB)|F(?:ER[2T]|LUP2|ST4Q))\s*

You're looking for a 6 digit number(\d{6})\s*

You're looking for a date(?=((?:\d{2}\/){2}\d{4}))\s*
((?:\d{2}\/){2}\d{4})\s*

(Note that the last date expression is an example of how badly constructed this is. It has a lookahead for a date, immediately followed by a non-capturing match of that exact same date. The look-ahead is therefore, meaningless. See the other answer on catastrophic backtracking, which is undoubtedly happening here.)

Pretty much everything else in that expression could match anything/nothing that I find it hard to believe they're expressing rules of any meaningful form. Real data just doesn't vary that much, or it varies with some precise rule that could be expressed in one (or more regexes).

So, junk the expression completely and stop trying to do it all at once. Do individual pieces (splitting the line on \s as necessary) and break this up into a several small regex'es that actually match what you really need to verify.