Troubleshooting YARN/MR issues on HDInsight clusters

HDInsight Hadoop clusters beginning with version 3.x are shipped with YARN , the next version of Hadoop MapReduce framework. YARN is a generic resource arbitration framework that acts as the computational layer for your distributed applications. To read more about YARN, please refer to Hadoop Apache documentation

Customers sometimes face issues while running either plain MR or their custom Yarn applications and are not sure if the issue is in their application or the system. In this post, I go over few useful resources that would help in doing preliminary root cause analysis.

Here’s how this post is organized, you may skip sections you are already familiar with -

Yarn concepts quick refresher

Yarn application lifecycle summary

Artifacts useful for troubleshooting

Configuration files

Web UI

JobHistory server

Application timeline server(ATS)

ApplicationMaster & container logs

ResourceManager logs

NodeManager logs

Yarn concepts quick refresher

ResourceManager (RM) : ResourceManager is the master service that's responsible for scheduling and allocation of cluster resources to applications. This used to be one of the responsibilities of the erstwhile JobTracker.

Containers : An abstraction for physical resources on the cluster - Memory&Cores.

NodeManager (NM) : NodeManager is the per slave node service that's responsible for life cycle management of application containers

Application Master (AM) : Responsible for application state management, requesting resources from RM and monitoring of application. This is the container 0 of the application.

JobHistory server : Web UI useful to browse finished MR applications

Yarn application lifecycle summary

The following diagram gives a toned down version of state changes of an application in its lifetime and the events that trigger them. Understanding this should help in narrowing down the root cause when an application misbehaves

State transition diagram during application's lifetime

Client sends an application request to RM and gets a new application id in response

Client then submits the application with metadata like the queue to which it should be submitted, priority of the app, max app attempts, resources required by AM etc. RM validates resource requirements for AM moving to submitted state.

RM validates the queue to launch AM

If queue exists and user has permission to use the queue, the application is accepted

Otherwise, the app is rejected moving it to failed.

Once accepted, a new attempt for the AM is created

A container is allocated for the attempt and is launched on one of the NMs. If AM launches successfully, it registers with RM moving application state to Running

If either AM launch fails or it crashes before registering with RM, app moves to failed state

Once AM goes into running, it negotiates resources from RM as the need arises and co-ordinates with NM to launch containers. Eg: Containers required to run map/reduce tasks in case of MR AM

On completion of the app, AM unregisters from RM moving it to finished state. Note that finished does not necessarily mean that app ran successfully Eg: an MR application could have reached finished state but could still fail according to MR semantics

The AM during its lifetime is also monitored by NM as any other container for its resource usage and is killed if it goes beyond what it has requested. Depending on retry policy set by the app during submission, the app is retried by spawning new attempts

Artifacts useful for troubleshooting

Configuration files

All the configuration data that YARN/MR depend on is set inside the following files:

Core-site.xml (default file system)

Yarn-site.xml (YARN configs)

Mapred-site.xml (mapred configs)

On HDInsight Windows clusters, they can be found under %HADOOP_HOME%/etc/conf directory

On HDInsight Linux clusters, they can be found under /etc/hadoop/conf directory

Web UI

This is the web endpoint for RM that has tons of useful information about cluster state, application state and links to tracking URL for the application. The web UI is hosted at the address configured for the property yarn.resourcemanager.webapp.address found in yarn-site.xml .

For HDInsight Windows clusters , a shortcut to this is installed on desktop and can be accessed by using RDP.

Shortcut to Yarn Web UI on Windows

The link can also be accessed outside the cluster using (for clusters with version > x.y.z.660) : https://<clusterDnsName>/Home/YarnExtUI

JobHistory server

This is the Web UI useful to browse MR applications that are completed. This is hosted at the address configured for the property mapreduce.jobhistory.webapp.address found in mapred-site.xml. You can also navigate to this link by clicking on the tracking URL for a finished MR application from the RM web UI

Application timeline server (ATS)

This is the Web UI useful to browse all YARN applications that are completed. This is hosted at the address configured for the property yarn.timeline-service.webapp.address found in yarn-site.xml. You can read more about ATS here

AM & container logs

AM logs are useful for troubleshooting failures due to application semantics.

For an application that is in running state, these logs can be accessed through web UI. To access the application master UI, you need to click on the "Tracking UI" as highlighted in the screenshot below.

Yarn Web UI highlighting link to AM tracking UI

In this example, I am running an MR application, so clicking on the AM link takes you to the UI that looks similar to JobTracker UI as shown in the screenshot.

AM UI for MapReduce application

Clicking on the link highlighted opens the UI for MR AM for that job. Clicking on the "logs" opens AM logs for the first attempt. In case of failures of AM, there would be multiple AM attempts listed here. You could also click on links available next to Maps/Reduces to access respective task logs

MR Job UI highlighting AM logs and Map task logs

For an MR application that has failed, the application state in Web UI is in finished while the final status is set to failed. Since the application is not running anymore, the tracking URL now points to the job history UI

RM Web UI highlighting a failed (MR) application

On navigating to JobHistory UI, we see that the application failed due to reduce task attempts failing. Clicking on the count opens a page that has a summary on why the task attempts failed

MR job UI highlighting links to failed task logs

"Note" field has the summary on why the reduce task attempt failed, in this case the task tried to execute "wc.exe" and failed since it was not able to find the file.

ResourceManager logs

RM logs contain information on RM health, application state and their transitions, container requests from application and any app master access traces

This data is useful to check:

If RM is up in cases when client is unable to connect to RM eg: REST API requests are failing with 5xx error.

Life cycle of an application for troubleshooting issues when application is not going to running state

Any warning messages related to queue if application is not getting resources

You can use the application id allotted by RM as the key to search all the pertinent information for the application. Below is the snapshot from RM logs for a given application that finished gracefully

RM logs for an application that finished successfully

This is how the logs look for an application attempt that failed to launch

RM logs for an application attempt that failed

On HDInsight Windows cluster, these logs can be found under %HADOOP_HOME%\logs when logged into headnode.

On HDInsight Linux clusters, they can be found under /var/log/hadoop-yarn/yarn. They can also be accessed via Ambari UI

NodeManager Logs

If NM is up in cases where total available nodes/resources are lower than expected.

If the container assigned to the node has been localized successfully , useful for troubleshooting when container does not go to running

Track container's resource usage

You can use numeric part of application id allotted by RM as the key to search all the pertinent information for the containers used by application on the node.

NodeManager logs for an AM container launched on it

On HDInsight Windows cluster, these logs can be found under %HADOOP_HOME%\logs when logged into headnode.On HDInsight Linux clusters, they can be found under /var/log/hadoop-yarn/yarn. They can also be accessed via Ambari UI

Conclusion

Hopefully, above post shed some light on getting general understanding of where to look for logs if an application misbehaves. In the next series of posts , I will cover specific examples of pathological cases and provide a step-wise guide to troubleshooting