NodeManagers crashing often - oom

NodeManagers crashing often - oom

In my dev clusters, there are 2 NodeManagers. It is crashing often for the past few weeks because of memory issues and tried to increase heap size for Node Manager process as temporary workaround (from 512 MB to 6 GB as of now). Before 2 days, it couldn't even able to start with 4 GB after crash and it worked after increasing it to 6 GB. Below graph took from CM showing the heap usage across node (jvm_heap_used_mb_across_nodemanagers metric).

Re: NodeManagers crashing often - oom

I think I'm also running into this problem. I found my NodeManagers were occasionally being sent SIGKILL from Cloudera's killparent.sh script which is run when NM receives an OutOfMemoryException. In Cloudera Manager, I don't see JVM memory usage trending up, so it's a bit of a mystery why it suddenly receives OOM when a second before, it was well below the limit.

Re: NodeManagers crashing often - oom

In my case, yarn nodemanager debug delay sec has been configured with veryhigh number (100+ days). Hence, lot of tasks has been scheduled fordeletion (actually deletion would have happen after above said days). Tillthat time, all those tasks info would be there, not cleared. Hence,consuming lot of memory.