Basically, I need to extract all data between "UB" and "=" into $1 so I can operate on it later. It can be either two or three lines long.

My code is the following:

while (<>) { /(UB.*=)/sm; print OUT "$1\n"; #This prints to file so I can check the output. Once the code works, I'll be removing it. }

The regex as coded keeps outputting empty $1. /(UB.*)/sm; returns the first line, /(.*=)/sm; returns the last line. I just can't put it all together. If it's better/easier, I do have an option to remove all \n from the file. (But then, if it's essentially all "one line", would the regex fire multiple times across that line?)

Your problem has nothing to do with the regular expression. It is fine as written. The problem is that the diamond operator (<>) reads one line at a time. No single line matches. When you change re regex, only the last lline matches.

You want to 'slurp' the entire file into single string before you do the match. The minimum change to correct your problem is to undefine the INPUT_RECORD_SEPARATOR ($/). Of course, your while loop would only run once.

I have some very large (50mb+) text files from which I need to clean up the data.

By the criteria of the files I am working with, these are very SMALL files. The files I am working on usually have sizes typically between 10 to 20 Gbytes, and sometimes up to 700 GB or even more.

;-)

You don't give enough information on you input file, but I would think that slurping the file after having defined the input separator as "=" or as "=\n" would probably help you very much.

If I can't open the text file in Notepad and parse it by hand, to me it's very large. ;) (I don't do this very often.)

My complete input file can be found here: http://vortex.plymouth.edu/~stjones00/Apr10.txt

The problem I have is there are incomplete entries mixed in with complete entries (plus other extraneous entries I don't want), so I need to parse out the wanted data. (It will begin with UA or UUA and end with =, but I need the leading line, hence my beginning the pattern with UB.) Then I need to take these individual, complete entries and perform some operations on them. (Test for a specific value, remove \n, etc.)

My thought was to match the pattern into $1 and send that to a subroutine to perform the operations.