YARN – What’s the big deal

Since the partnership between Hortonworks and SAS we have created some awesome assets (i.e., SAS Data Loader sandbox tutorial, educational webinars and array of blogs) that have enabled Hadoop and Big Data enthusiasts’ hands-on training with Apache Hadoop and SAS’ powerful analytics solutions. You can find more details around our partnership and resources here: https://hortonworks.com/partner/sas

To continue the momentum, we have Paul Kent, Vice President of Big Data at SAS, share his insights on the value of YARN and the benefits it brings to SAS and its users- this time around SAS Grid and YARN.

On my travels and in the SAS Executive Briefing Center, it has become more obvious that many folks have grabbed on to the idea that Hadoop will allow them two things:

to assemble a copy of all their data in one place

to provide enough processing horse power to actually make some sense (business value) of the patterns contained in a holistic view of said data

As they get closer to this goal they realize what a valuable resource the data lake has become. They need an effective means to “share nicely” – its not likely that every department is going to have the resources to establish their own data lake, and even if they do, you’ll be back to arguing about which version of the truth is the correct one.

YARN is the component in the Hadoop eco-system that helps folks share the value gained from building a shared pool of the organizations data.

Move the work to the Data

As the data volumes and velocities grow it has become important to find a strategy that minimized the number of hard (permanent) copies of data (and inherent reconciliation and governance). YARN allows Hadoop to become “the Operating System for your data” – a tool that manages and mediates access to the shared pool of data, as well as the resources to manipulate the pool.

Yarn allows the various patterns of work destined for your cluster to form orderly and rational queues, so that you can set the policy for what is urgent, what is important, what is routine, and what should be allowed to soak up resources so long as no one else requires them at the moment.

Expand then Consolidate

Disruptive technologies like Hadoop are often deployed “at the fringes” of an organization (perhaps in an Innovation Lab). Initial ROI is often found by attacking new ground – problems the organization had not attempted to handle (or handle at scale) before. When these early projects succeed I’ve seen many customers ask themselves “well, that worked OK; is there some way to consolidate the older ways of doing things into this new world?” – Simplifying and modernizing their Analytics Landscape as a delightful side effect!

In reality the blue box for “SAS” above is really a few distinct patterns of work for the Hadoop Cluster

Long Running Server (sometimes called Daemon) processes. The SAS LASR server is purpose built to load your important data into distributed memory and provide low latency actions to service requests against that data rapidly.

Resource Intensive single user tasks that require distributed computing. Each Invocation of a SAS HPA procedure (to build a regression model, to train a neural network, or to determine a decision tree) needs memory and CPU cycles from several cluster nodes to perform its task, and those resources are returned to the pool immediately after the task completes.

Traditional Grid Computing where jobs from many users are distributed over several servers to improve availability and ultimately response times. This is not distributed (Massively Parallel) computing in the sense that several computers attack one problem, but it is a form of sharing the load where several computers attack the tasks of several users in a divide and conquer style.

The first two patterns above are examples of new world Distributed Computing. The third is an example of using the newer infrastructure to replace (at a lower cost) the hardware used for a previous generation Analytics Landscape. Also SAS Grid Manager is the only product to provide horizontal scaling of an application where some parts of the application need to operate on all of the data, such as a Monte Carlo simulation. The “cherry on top” is that you can combine these technologies such that a single SAS Grid job running on a Hadoop data node could kick off an HPA job that would distribute vertically to send processing to each node to operate on the local data.

I asked Cheryl Doninger, who leads the development for SAS Grid Manager, why customers should be excited about this new flavor of SAS Grid Manager and she said – “SAS Grid Manager for Hadoop is a perfect fit for our customers who have, or plan to implement in the near future, a multi-application data operating system, as described by Arun. Now they can co-locate all of their SAS Grid jobs on the Hadoop cluster and manage them with YARN along with any other analysis being done on the cluster. The SAS Grid jobs can leverage any of the SAS integration points to Hadoop to maximize the value of this shared pool of data and all through direct integration with YARN or by leveraging other components of the Hadoop ecosystem that are natively managed by YARN.”

All this effort was a result of tightly integrated joint Engineering collaboration with Hortonworks and the Apache YARN team, including committer, Arun Murthy.

Tags:

Your email address will not be published. Required fields are marked *

Comment

Name*

Email*

Related Posts

BLOG

10.2.17

Why Rapid Data Insights are Crucial...

With today’s new rapid pace, speed to market is a huge factor for any business. The faster a company can gain insights from their data, the better they can serve their customers. If changes aren’t made quickly enough, there’s a significant risk of losing customers and market share. One example of gaining faster insights from…

How Data Provides a Full View...

Before making important business decisions, it’s crucial for a company to see a complete picture of what’s going on. To do this, they need to gather the necessary data to make the most informed decision possible. One example of a company who is using data to get a complete picture of their business is DHISCO.…

How the Leading Transportation Tech Provider...

Last Friday, I wrote about how TMW Systems leverages Hortonworks to crack the last mile problem. TMW, the leading transportation technology provider, consolidates the data of their carrier customers to deliver fresh analytical insights to these small carriers, thereby giving them the legs to walk the last mile. TMW’s use case also highlights the importance…

Vizient: Predictive Analytics in Healthcare

The critical business challenge for healthcare organizations is to effectively manage their data. Success means access to real-time market data, data visualization, and cost-saving opportunities. Data virtualization and predictive analytics further improve both the business side of healthcare organizations, and patient care. At San Jose DataWorks Summit (June 13-15), Vizient will show how predictive analytics helps connect members…

Mitsubishi Fuso Selects Hortonworks to Power...

Yesterday we announced that Mitsubishi Fuso Truck and Bus Corporation has deployed Microsoft Azure HDInsight, powered by Hortonworks Data Platform (HDP ®), in the public cloud to power the company’s connected data architecture. Notably, “Mitsubishi Fuso’s big data strategy began in 2014 and since then the company has undergone a process to modernize all operations.…

Danske Bank: Data is Mission Critical

Danske Bank, headquartered in Copenhagen, is the largest bank in Denmark. It’s also one of the major retail banks in the northern European region, with over 5 million retail customers. Data is mission critical to Danske Bank as it provides them with actionable intelligence to help minimize risk and maximize opportunities. In our latest video,…

Parametric: Driving Alpha in the Financial...

With the San Jose DataWorks Summit (June 13-15) just one month away, we’re busy finalizing the lineup of an impressive array of speakers and business use cases. This year our Enterprise Adoption Track will feature Scott Sovine, Director of Project and Data Management, and Amir Aliabadi, Data Architect, both at Parametric Portfolio Associates, LLC. Also present will be…

LIXIL Uses HDP to Be First...

Hortonworks continues to expand its list of customers in the Asia Pacific region, as well as in the housing and building industry. We recently completed a case study to showcase how LIXIL Corporation uses HDP to be first in manufacturing for the Japanese Smart Home Market. READ THE FULL LIXIL CASE STUDY HERE LIXIL is a…

Hortonworks 2016 Year in Review

As we kick off the new year I wanted to thank our customers, partners, Apache community members, and of course the amazing Hortonworks team, for an amazing 2016. Let’s take a step back and look at some of the Hortonworks highlights from last year... IN THE ECOSYSTEM there was tremendous acceleration. At the beginning of…