15 พฤศจิกายน 2009

Little tricks on working with LARGE list in Scala

Since right now I’m processing large amount of binary files (300+ files * 4MB each) in Scala and store some of results in a list. Some of typical method may not result good when processing in terms of large lists. So here is the tip I can share with you guys from my trial & error:

Beware RECURSIVE method!

This is such an obvious evil that you may already known this. If you’re working with processing the list result and get stack overflow error then I suspect you’re somehow using recursive method. Avoid that if possible. One method I had discovered to lead to stack overflow is dropRight since I use this method in order to reduce list size. Others also had shown that foldRight could cause this error as well. If you want to know which method may harm you then you may tries using scala interpreter and construct List of integer that the size is larger then 4K elements. However if you REALLY want to know then I suggest you go for source code.

Immutable List should be use cautiously.

Why? Because Immutable is not mutable! That means the list may not change over time, and the only way to add / remove element by using immutable list is to return a new list. If you’re trying to constantly append the processing result into a list by the end of each iteration, then you are going to reassign list every iteration. This could drain your memory real fast. My alternatives is that you may prefer scala.collection.mutable.ListBuffer instead.

Watch out for linear-time method usage while processing lists

Since processing multiple files may yield a list of result for each iteration, which each element had to be stored in another list (with same index). Writing codes like this within main file reading loop:

for ( i <- 0 until loopResult.length ) {
//Some other process also going on
processResult(i) += loopResult(i)
}

had cost me extra 60 seconds comparing with this loop when I’m processing files in byte-by-byte fashion, with each loop result had 20 elements. (Be advised that my actual logic is more complex than this but just to show the idea)

And it even could be optimized with using zip command like below. However I haven’t tried it myself so I cannot guarantee this is better.

for ( elem <- loopResult zip processResult ) {
elem._2 += elem._1
}

Processing “result list” after file reading loop if possible.

It’s reasonable to processing file to get result for each loop and append it into a result list. However if you may require extra processing e.g. sort the list or filter out N least element, then it’s better to process these after you’ve complete reading the file because processing these stuffs will usually lead you to O(n^2) when n = file size. You can even use immutable list and sort method for this case. For my case, I had to filter out the least element to reduce the size of the list from 300K to 40K. It took less than a few seconds when I process this after completely processing whole file, comparing to forever (It tooks so long and I kill its process) when I tries to achieve this within file reading loop.

If you’re interested in processing large files, then don’t forget to read Processing large files with Scala as well. His contribution really saves my ass since I totally have no idea on what to start at all.

However I do believe these might not fit everyone since you usually don’t have to work with large list. Some tips might also slow your coding speed down so please adjust this tips accordingly to your need.

That’s it for what I’ve discovered. You guys can share or even comments about this since I’m just a scala (and functional language) newbie.