Need a databag that does not register with SpillableMemoryManager and spill data pro-actively

Details

Description

POPackage uses DefaultDataBag during reduce process to hold data. It is registered with SpillableMemoryManager and prone to OutOfMemoryException. It's better to pro-actively managers the usage of the memory. The bag fills in memory to a specified amount, and dump the rest the disk. The amount of memory to hold tuples is configurable. This can avoid out of memory error.

Olga Natkovich
added a comment - 24/Sep/09 21:59 Couple of questions comments on the patch:
Why do we need to synchronize in add. Who else is accessing the bag since it is no longer managed by spillable manager?
Memory fraction should be a java property so that users can control it they choose so
Why do we have limit of only 100 tuples in memory since we already have memory limit? Also, if we do need it, shouldn't it be configurable?

1. The synchronization can be removed.
2. Memory fraction is configurable. the property name is pig.cachedbag.memusage, default value is 0.5
3. The first 100 tuples are used to calculate tuple size in memory to determine how many tuples can fit into the configured memusage. It's not the number of tuples kept in memory

Ying He
added a comment - 24/Sep/09 22:46 Answer to Olga's questions:
1. The synchronization can be removed.
2. Memory fraction is configurable. the property name is pig.cachedbag.memusage, default value is 0.5
3. The first 100 tuples are used to calculate tuple size in memory to determine how many tuples can fit into the configured memusage. It's not the number of tuples kept in memory

I think it might be a good idea to have a config parameter (maybe a java -D property) which can allow users to choose between spillableBagForReduce and NonSpillableBagForReduce with the Non spillable one being the default. This way if for some reason users find the spillablebag better for their query they can use it.

Pradeep Kamath
added a comment - 25/Sep/09 17:40 I think it might be a good idea to have a config parameter (maybe a java -D property) which can allow users to choose between spillableBagForReduce and NonSpillableBagForReduce with the Non spillable one being the default. This way if for some reason users find the spillablebag better for their query they can use it.

Ying, what Pradeep is asking for is more like a safety switch - to give users a way to go back to the old implementation if they run into problem with new. Once we verify that the new code is as stable as the old, we would remove the switch. We would also not expose it to users unless they do run into trouble.

Olga Natkovich
added a comment - 25/Sep/09 18:43 Ying, what Pradeep is asking for is more like a safety switch - to give users a way to go back to the old implementation if they run into problem with new. Once we verify that the new code is as stable as the old, we would remove the switch. We would also not expose it to users unless they do run into trouble.