Have end to end tests based on MiniMRCluster to verify correct behaviour of slot reclamation by queues.

Details

Type: Bug

Status:Resolved

Priority: Major

Resolution:
Incomplete

Affects Version/s:
None

Fix Version/s:
None

Component/s:
None

Labels:

None

Description

We should have a test that submits long running jobs to different queues one after the other, and ensures that queues get required capacity or get back taken-away capacity after killing tasks within the specified amount of time.

Issue Links

is related to

HADOOP-4830Have end to end tests based on MiniMRCluster to verify that queue capacities are honoured.

Activity

While testing, I came across the following problem with ReclaimCapacity functionality.

When reclaim-capacity interval is sufficiently small (1 or 2 seconds, default is 5), I see a lot of the following exceptions in the log. This is a fatal exception and affects one iteration of reclaim capacity functionality. The reason for this is that TaskStatus only gets populated when a TT reports back launching of a task. But we don't have null checks for TaskStatus in TaskSchedulingMgr.killTasksFromQueue, thus causing this error. This is not visible when reclaim-interval is not small enough, as within that much time, TTs report back and TaskStatus will never be observed to be null.

09/01/21 12:14:35 ERROR mapred.CapacityTaskScheduler: Error in redistributing capacity:
java.lang.NullPointerException
at java.util.TreeMap.getEntry(TreeMap.java:341)
at java.util.TreeMap.get(TreeMap.java:272)
at org.apache.hadoop.mapred.TaskInProgress.killTask(TaskInProgress.java:741)
at org.apache.hadoop.mapred.CapacityTaskScheduler$MapSchedulingMgr.killTasksFromJob(CapacityTaskScheduler.java:878)
at org.apache.hadoop.mapred.CapacityTaskScheduler$TaskSchedulingMgr.killTasksFromQueue(CapacityTaskScheduler.java:612)
at org.apache.hadoop.mapred.CapacityTaskScheduler$TaskSchedulingMgr.killTasks(CapacityTaskScheduler.java:594)
at org.apache.hadoop.mapred.CapacityTaskScheduler$TaskSchedulingMgr.reclaimCapacity(CapacityTaskScheduler.java:531)
at org.apache.hadoop.mapred.CapacityTaskScheduler$TaskSchedulingMgr.access$800(CapacityTaskScheduler.java:362)
at org.apache.hadoop.mapred.CapacityTaskScheduler.reclaimCapacity(CapacityTaskScheduler.java:1216)
at org.apache.hadoop.mapred.CapacityTaskScheduler$ReclaimCapacity.run(CapacityTaskScheduler.java:1001)
at java.lang.Thread.run(Thread.java:636)

Vinod Kumar Vavilapalli
added a comment - 21/Jan/09 13:32
While testing, I came across the following problem with ReclaimCapacity functionality.
When reclaim-capacity interval is sufficiently small (1 or 2 seconds, default is 5), I see a lot of the following exceptions in the log. This is a fatal exception and affects one iteration of reclaim capacity functionality. The reason for this is that TaskStatus only gets populated when a TT reports back launching of a task. But we don't have null checks for TaskStatus in TaskSchedulingMgr.killTasksFromQueue, thus causing this error. This is not visible when reclaim-interval is not small enough, as within that much time, TTs report back and TaskStatus will never be observed to be null.
09/01/21 12:14:35 ERROR mapred.CapacityTaskScheduler: Error in redistributing capacity:
java.lang.NullPointerException
at java.util.TreeMap.getEntry(TreeMap.java:341)
at java.util.TreeMap.get(TreeMap.java:272)
at org.apache.hadoop.mapred.TaskInProgress.killTask(TaskInProgress.java:741)
at org.apache.hadoop.mapred.CapacityTaskScheduler$MapSchedulingMgr.killTasksFromJob(CapacityTaskScheduler.java:878)
at org.apache.hadoop.mapred.CapacityTaskScheduler$TaskSchedulingMgr.killTasksFromQueue(CapacityTaskScheduler.java:612)
at org.apache.hadoop.mapred.CapacityTaskScheduler$TaskSchedulingMgr.killTasks(CapacityTaskScheduler.java:594)
at org.apache.hadoop.mapred.CapacityTaskScheduler$TaskSchedulingMgr.reclaimCapacity(CapacityTaskScheduler.java:531)
at org.apache.hadoop.mapred.CapacityTaskScheduler$TaskSchedulingMgr.access$800(CapacityTaskScheduler.java:362)
at org.apache.hadoop.mapred.CapacityTaskScheduler.reclaimCapacity(CapacityTaskScheduler.java:1216)
at org.apache.hadoop.mapred.CapacityTaskScheduler$ReclaimCapacity.run(CapacityTaskScheduler.java:1001)
at java.lang. Thread .run( Thread .java:636)
Inserting null checks prevents this.