I am trying to measure clustering in a network by counting how many groups of 4 nodes I have that have at least 3 connections between them (also counting the number with 4 connections in between them). The data input file is essentially node A touches node B. I am using a series of several loops and it is taking upwards of 10 hours to run for my really large data sets. Any ideas for how to speed this up?

The array @gene_list ends up being about 1000 nodes in total and the file input of connections is about 9000 entries long.

Code

print 'Taking some time to read file...', "\n"; open FILE2, "my file.txt" or die $!;

Another thing you might consider is to use the foreach syntax, which is more perlish and is supposed to be slightly faster than the C style for. It usually does not make a difference, but since these instructions are repeated so many times, it does make sense to try something like:

I have tried your solution and mine with only 3 loops ($i, $j and $k, as just above), my code ran in 15 seconds, yours in 45 seconds. (This doesn't mean that you will get the same ratio of improvement: in both cases, the most inner loop did almost nothing (it only incremented a variable), your real code is doing many other things, so that optimizing the looping mechanism will not pay off as much compared to total execution duration.)

Finally, look very carefully at everything that is happening in the two most inner loops, they are executed so many times that any optimization you can find there will pay off a lot. In particular, you have again this "scalar @nodes" (or whatever) that should be changed to a precalculated variable in order not to recalculate the size of this array each time over.

Having said that, I definitely agree with FishMonger that you should profile your code to get a better idea of where the program is spending a lot of time.

I reformatted a bit by first making a hash with all of the interactions which resulted in a much shorter grep down the line. While it is not super fast, it is probably about 10-fold faster thanks to change in computing the size of the array and using the hash.

I did try that and it ended up just about the same speed, but it is good to start using that style anyways.

I think the reason (using the module posted earlier) is that it is the second part of the code (where it takes the output of the quadruple loop and counts matches) that is taking the vast majority of time.

Well I do not understand well enough what you are trying to do exactly.

I tested today at work a complete Perl rewrite that I did of a PL-SQL type of program and managed to speed it up by a factor of more than 60 (on a given data set, the original program took 1h18mn to run, my new version ran in 1mn and 14 sec, and I have added today one further improvement that I haven't had time to test yet). But this implied that I fully understood what the original program was doing so as to think of a completely different way to do it taking advantage of Perl's powerful data structures (hashes of hashes of hashes), not available in PL-SQL.

I do not understand well enough what you are doing to be able to guide you much more.

(To tell the truth, I have done quite a lot of performance improvement work over the last few years, but it is the first time that I obtain such a huge improvement ratio, my best previous success was a factor of 20, for which I was already very happy and quite proud.)

What I am trying to do is take the first file which is organized in a two column format (nodeA, nodeB) with several thousand entries. Each row represents a network connection (so a connection between node A and node B in the network).

What I am trying to do is measure clustering by counting up the number of times that I find certain arrangements. In this case, I am looking for groups of 4 nodes that share 4 or more edges together (evidence of clustering).

Before the first loop is taking all of the connections of nodes from the first list and removing any repeats with the hash so we end up with a list of every node in the network.

The first loop is attempting to iterate through this entire list of nodes for every possible combination. What the loop should be doing is: first: 1,2,3,4 second: 1,2,3,5 third: 1,2,3,6 . . etc until all of the combinations are covered.

Each of these combinations is then run through grep to add up how many times one member of the list of 4 nodes is connected to another of the 4 nodes. If it is more than 4, we add 1 to our count and continue on to the next set of four genes.

Let me know if that makes sense!

Edit: The code i posted is adding up >2, >3, >5 hopefully this doesn't confuse you I was running that to check and see if it was working correctly on a dummy file earlier. What I am really looking for is >4, >5 but those can be changed to be whatever depending on how many connections we are looking for!

Consider using the function combinations in the module Algorithm::Combinatorics. Be sure to read the 'SUBROUTINES' section of the document for information on how to use the function as an iterator. I have not studied your code. Your example suggests that you are examining every permutation rather than every combination. That would have a huge time penalty. Even if you are doing it correctly, I would expect the c-code in the module to be faster than an all-perl solution.Good Luck, Bill

I will look into the module and hopefully it will speed up that part of the code. I thought that with the $i, $i+1, $j+1, etc. sort of structure it would cover only the combinations (I don't think there would be any repeats in this loop as the list is unique).

However, it really also depends on your data in your file. If you may have in your file (nodeA, nodeB) and somewhere else in the same file (nodeB, nodeA), then you might be really doing permutations without knowing.

If you can have that kind of unseen duplicates, then you should preprocess your file to remove them. With 4 nodes, you may end up with as much 24 times less processing to do if I understand well enough what your are doing.

I agree with Laurent_R. You are taking combinations, but not combinations of the right things. You should be examining all combinations of four nodes. Your data repeats a node for every connection to it. Therefore, you are taking combinations of connections rather than nodes. Process your list of all nodes with List::MoreUtils qw( uniq ) to remove all duplicates before starting to compute combinations. Good Luck, Bill