I have a program where I have to search through a huge array with 3,000,000 items and return every matching item in @prkeys. Each time the program runs I'm running this search routine 3000+ times. It can take me 90+ minutes to run the whole program. Here's the code that is slowing me down.

Code

# Regex will be: MODEL.+OLDPRICE $k=$basemodel.'.+'.$oldprice; # Find all models that start with key and end with price. @k=grep(/^$k/, @prkeys);

The intention of this code is to make a fuzzy match looking for strings that begin with a model number and end with an price. This is my last resort to find a match on a model, and it must be a fuzzy match of some type, although I could split the search into 2 parts, but it seems both parts would be a regex and splitting it into 2 parts would slow me down even more.

@prkeys is an array that contains 3,000,000+ items.

@prkeys is all in memory.

Each string in @prkeys is 5-30 characters long.

I must return every item that matches in @prkeys.

Each time I search for $k, $k starts with a model number, and ends with a price. So the regex looks like: <code>/MODEL.+PRICE/</code>

Because of the bad data the customer gives us I do have to use this search method of searching 3,000,000 strings.

This OS is a virtual machine and there are other VMs on this physical server, and I suspect the other VMs are also slowing me down. I cannot move the VM to another physical server, so I must address speed in the code itself.

I have a test program to test read speed but I have no other ideas how to speed this up. Speed normally is not an issue for me.

Questions

How can I speed this up? Each time this one line runs it takes about 2 seconds. That's 6000 seconds just for this one line only, not counting any other overhead and processing for the rest of the program.

Will I have to use another data structure to search for all this data?

Thank you for your help. I normally don't have to do such searching on a huge dataset.

I will post a link to the huge file and a test program for you to use shortly.

When you read your file, split the data between the key and prices. Store each price in an array ref within an hash entry with that key. Then, when looking up, find the hash entry for the key, and just look into the array ref stored there.

Example. Suppose you have this input data:

Code

key1 price1 key1 price2 key2 price1 key2 price3 key2 price4

For this small dataset, you would construct a hash with two entries, for key1 and key2, and the hash values would be the price lists:

Code

( key1 => (price1, price2), key2 => (price1, price3, price4) )

Then, when you are doing the lookup, if you have key1, you need to grep among an arrayref containing only two values; if you have key2, you grep among three values. You're no longer parsing the entire dataset each time (because access to a hash is almost immediate).

Note that you could also possibly use a hash of hashes, that would be even quicker, but whether you can do it depends on the details of your data.

Some quick untested pseudo-code for constructing the hash of arrays (making some simple assumptions about your data, you'll have to adjust to their format):

Because guessprice() is called 3000+ times, using $_[n] instead of parameter names cuts time by about half, which is amazing. I'm thinking I don't even need this line either: "my($base,$oldpr)=@_;" One website said Perl does a lot of excessive string copying.

This OS is a Virtual Machine and it shares space with other VMs on a single physical server, and I think other VMs are also slowing this program down, but I cannot fix that part. -----

You don't need nor should upload 800Mb files. You should reduce your examples down to a small but reasonable enough size to demonstrate the problem.

As I mentioned in your other thread, profiling should be your first step. Then write short test scripts on the sections that need optimization which can be used to benchmark different approaches in doing those portions.

How did you determine that is the statement that needs to be optimized?

There's not much you can do to optimize that 1 statement. The bigger issue you should be looking at is the grepping of that array 3,000+ times. You should look into better ways to do that loop rather than focusing on that 1 grep statement.

Since I'm looking for a partial key, a hash will not work here. So that's why I went with a regex.

I don't think that's true. As I already said (but perhaps you did not see my post because you posted at almost the same time) , given your data set, it seems that a hash will do if you reduce the hash key to the partial data key. Your regex (or its first part) isn't doing more than keeping only part of your data key; you can just to the same for your hash key.

Don't say nay if you don't have a really good reason to do so. From everything I have seen about your data, it should just work. Try it, your program will probably run so much incredibly faster that you'll find it hard to believe (you might even end up thinking the program died, whereas it will have completed its task).

How did you determine that is the statement that needs to be optimized?

By stepping through the code. I can watch that it actually takes 3 seconds to execute that one line once. I don't actually need to profile code in a complex way if all other areas take a fraction of a second to execute and this one line takes 3 seconds. You don't need a Corvette just to go get groceries.

I actually ended up using a hash of arrays. The key to the hash is the first 4 ctrs of each model, so the resulting array in that hash value is very limited, so my grep looks at a much smaller chunk of data.

It works great and my program, with all other improvements takes only 10% as long as it used it. -----

Good question. I cannot find an answer. I have always assumed that qr is the same as all other regex operators. It always recompiles if it contains a variable. The compilation is much faster if the content of the variable is pre-compiled code. Under this assumption, we control where the compilation is done by where we place the qr. We gain speed when we move compilation out of a loop. Good Luck, Bill