Kuganesan, Srijeyanthan

Abstract [en]

In the last year, Hadoop YARN has become the defacto standard resource management platform for data-intensive applications, with support for a wide range of data analytics platforms such as Apache Spark, MapReduce V2, MPI, Apache Flink, and Apache Giraph. The ResourceManager fulfills three main functions: it manages the set of active applications (Applications service), it schedules resources (CPU, memory) to applications (the FIFO/Capacity/Fair Scheduler), and it monitors the state of resources in the cluster (ResourceTracker service). Though YARN is more scalable and fault-tolerant than its predecessor, the Job-Tracker in MapReduce, its ResourceManager is still a single point of failure and a performance bottleneck due to its centralized architecture. Single point of failure problem of YARN has been addressed in Hops-YARN that provides multiple ResourceManagers (one active and others on standby), where the ResourceManager’s state is persisted to MYSQL Cluster and can quickly be recovered by a standby ResourceManager in the event of failure of the active ResourceManager.

In large YARN clusters, with up to 4000 nodes, the ResourceTracker service handles over one thousand heartbeats per second from the nodes in the cluster (NodeManagers), as such become a scalability bottleneck. Large clusters handle this by reducing the frequency of heartbeats from NodeManagers, but this comes at the cost of reduced interactivity for YARN (slower application startup times), as all communication from the ResourceManager to NodeManagers is sent in response to heartbeat messages. Since Hops-YARN is still using a centralized scheduler for all applications, distributing the ResourceTracker service across multiple nodes will reduce the amount of heartbeat messages that need to be processed per ResourceTracker, thus enabling both larger cluster sizes and lower latency for scheduling containers to applications. In this thesis, we will scale-out the ResourceTracker service, by distributing it over standby ResourceManagers using MySQL NDB Cluster event streaming. As such, the distributed Resource Management for YARN that is designed and developed in this project is a first step towards making the monolithic YARN ResourceManager scalable and more interactive.