Details

Description

While profiling tasktracker with Sort benchmark, it was observed that threads block on LocalDirAllocator.getLocalPathToRead() in order to get the index file and temporary map output file.

As LocalDirAllocator is tied up with ServetContext, only one instance would be available per tasktracker httpserver. Given the jobid & mapid, LocalDirAllocator retrieves index file path and temporary map output file path. getLocalPathToRead() is internally synchronized.

Introducing a LRUCache for this lookup reduces the contention heavily (LRUCache with key =jobid +mapid and value=PATH to the file). Size of the LRUCache can be varied based on the environment and I observed a throughput improvement in the order of 4-7% with the introduction of LRUCache.

I'd like to understand what part of LocalDirAllocator.getLocalPathToRead is expensive... it's fine to add a cache, but it's better to do it after we understand why we really need it.

This patch results in the code path skipping the sanity checks in LocalDirAllocator.confChanged which is called by LocalDirAllocator.getLocalPathToRead. That is a concern. Again, this might be the expensive part of LocalDirAllocator.getLocalPathToRead, but we need to ensure that.

Don't get me wrong, the focus of this jira is very useful - we just need to fix it the 'right' way.

Arun C Murthy
added a comment - 08/Sep/10 17:46 Couple of concerns:
I'd like to understand what part of LocalDirAllocator.getLocalPathToRead is expensive... it's fine to add a cache, but it's better to do it after we understand why we really need it.
This patch results in the code path skipping the sanity checks in LocalDirAllocator.confChanged which is called by LocalDirAllocator.getLocalPathToRead. That is a concern. Again, this might be the expensive part of LocalDirAllocator.getLocalPathToRead, but we need to ensure that.
Don't get me wrong, the focus of this jira is very useful - we just need to fix it the 'right' way.

1. For #1, I would post the profiler output of which methods are expensive in getLocalPathToRead().

2. For #2, the code path for LocalDirAllocator.confChanged() need not be called in this context of TaskTracker.

Reason: In this context, TaskTracker is trying to check for any config changes related to "mapred.local.dir" using LocalDirAllocator. Once its read, this parameter does not change over TaskTracker's lifetime. Hence, it is not mandatory to do this check for every invocation. Corner case: When tasktracker goes down and new configs are reloaded, the LRUCache would also be repopulated.

Rajesh Balamohan
added a comment - 13/Sep/10 04:19 Thanks for the review comments Arun.
1. For #1, I would post the profiler output of which methods are expensive in getLocalPathToRead().
2. For #2, the code path for LocalDirAllocator.confChanged() need not be called in this context of TaskTracker.
Reason: In this context, TaskTracker is trying to check for any config changes related to "mapred.local.dir" using LocalDirAllocator. Once its read, this parameter does not change over TaskTracker's lifetime. Hence, it is not mandatory to do this check for every invocation. Corner case: When tasktracker goes down and new configs are reloaded, the LRUCache would also be repopulated.

CPU profiler output of a TaskTracker "without" the patch. Profiler output shows that ~4% of time being spent on internal methods of LocalDirAllocator.*.getLocalPathToRead(). Rest of them time is spent on the method itself. This along with the monitor profiling shows synchonization to be the bottleneck in getLocalPathToRead().

Rajesh Balamohan
added a comment - 13/Sep/10 06:28 CPU profiler output of a TaskTracker "without" the patch. Profiler output shows that ~4% of time being spent on internal methods of LocalDirAllocator.*.getLocalPathToRead(). Rest of them time is spent on the method itself. This along with the monitor profiling shows synchonization to be the bottleneck in getLocalPathToRead().

I have written a job with one map which output 1M data, and 100 reduces. Each reduce spawn 10 threads to fetch data from map side 3k times just like shuffle phase. When run this job, most of work threads is blocked on AllocatorPerContext.

With LRUCache, most work threads are blocked on LOG.info() as following stack.

Liyin Liang
added a comment - 15/Jun/11 05:53 This is a great patch. Here is part of the stack when work thread is blocked:
"1797055149@qtp0-98" prio=10 tid=0x0000002aa1a40000 nid=0x333 waiting for monitor entry [0x0000000049dc5000]
java.lang. Thread .State: BLOCKED (on object monitor)
at org.apache.hadoop.fs.LocalDirAllocator$AllocatorPerContext.getLocalPathToRead(LocalDirAllocator.java:377)
- waiting to lock <0x00000000a0000090> (a org.apache.hadoop.fs.LocalDirAllocator$AllocatorPerContext)
at org.apache.hadoop.fs.LocalDirAllocator.getLocalPathToRead(LocalDirAllocator.java:142)
at org.apache.hadoop.mapred.TaskTracker$MapOutputServlet.doGet(TaskTracker.java:3086)
I have written a job with one map which output 1M data, and 100 reduces. Each reduce spawn 10 threads to fetch data from map side 3k times just like shuffle phase. When run this job, most of work threads is blocked on AllocatorPerContext.
With LRUCache, most work threads are blocked on LOG.info() as following stack.
"1793911889@qtp0-101" prio=10 tid=0x0000002aa1530000 nid=0x34f2 waiting for monitor entry [0x0000000041d45000]
java.lang. Thread .State: BLOCKED (on object monitor)
at org.apache.log4j.Category.callAppenders(Category.java:204)
- waiting to lock <0x00000000a01be928> (a org.apache.log4j.spi.RootLogger)
at org.apache.log4j.Category.forcedLog(Category.java:391)
at org.apache.log4j.Category.log(Category.java:856)
at org.apache.commons.logging.impl.Log4JLogger.info(Log4JLogger.java:133)
at org.apache.hadoop.mapred.TaskTracker$MapOutputServlet.doGet(TaskTracker.java:3246)
With LRUCache + disable LOG.info(): This job takes 3mins, 19sec to run.
Without LRUCache + enable LOG.info(): This job takes just 37sec to run.
b.t.w LRUCache should use mapId as key instead of (jobId + mapId) . Because jobId is just part of mapId.