The Future of Apache Hadoop

The world’s top authorities on Apache Hadoop convene at Hadoop Summit San Jose and one of the top questions that will be answered will be around the future and direction of Hadoop. Sanjay Radia – Founder and Architect, Hortonworks lead the track which selected 13 sessions around this topic. I asked Sanjay what he hoped would be covered by these sessions:

“Hadoop continues to drive innovation at a rapid pace and the next generation of Hadoop is being built today. This track showcases new developments in core Hadoop and closely related technologies. Attendees will hear about key projects, such as HDFS and YARN, projects in incubation and the industry initiatives driving innovation in and around the Hadoop platform. Attendees will interact with technical leads, committers, and expert users who are actively driving the roadmaps, key features, and advanced technology research around what is coming next for the Hadoop ecosystem.”

I asked Sanjay if I were pressed for time what would be the top 3 sessions in the can’t miss category. It took some arm twisting but this is top 3 sessions he would recommend:

Apache Hive 2.0 SQL Speed Scale

Speakers: Alan Gates from Hortonworks

Apache Hive is the most commonly used SQL interface for Hadoop. One of its most frequent uses is data warehousing applications. To meet customer warehousing requirements it is important that it scale to petabytes of data, provide the SQL that users need, and perform in interactive time. The Hive community is working towards a 2.0 release of Hive that includes significant new features and performance improvements. These include: * Adding LLAP, a daemon layer that enables sub-second response time. * Adding HBase as an option to store Hive’s metadata, resulting in faster metadata access and reduced query planning time. * Improving Hive’s support for ingesting data at high speed from streaming inputs such as Apache Flume and Apache Storm. * Improving and expanding Hive’s support for managing changing data in a transactionally consistent way by adding the SQL MERGE command. * Laying the groundwork through Apache Calcite to enable Hive to use multiple storage engines (e.g. HBase) This talk will cover the use cases these changes enable, the architectural changes being made in Hive as part of building these features, and share performance test results on how these improvements are speeding up Hive.

A multi-colored YARN: Apps and first-class support for services

Speakers: Vinod Kumar Vavilapalli from Hortonworks

Apache Hadoop YARN is a modern resource-management platform that can host multiple data processing engines for various workloads like batch processing (MapReduce), interactive (Hive, Tez, Spark) and real-time processing (Storm). These applications can all co-exist on YARN and share a single data center in a cost-effective manner with the platform worrying about resource management, isolation and multi-tenancy. There are more use-cases that deserve the same set of powerful platform features. In this talk, we’ll talk about a new suite of use-cases that YARN community is working towards – services. YARN as a technology has always had the right foundations to support a wide variety of applications and services. Support for bringing existing and new services to YARN deserves a fresh look though. With this attention on making services simplified and first-class, we will walk through how Apache Hadoop YARN is morphing to support services well out of the box through various platform level efforts. Business also increasingly care less about the infrastructure and more about how to drive the end-to-end user-cases. In this context, we will also discuss APIs, tool-set and how the new multi-colored YARN’s story empowers the developer community.

Evolving HDFS to a Generalized Distributed Storage Subsystem

Speakers: Sanjay Radia and Jitendra Pandey from Hortonworks

We are evolving HDFS to a distributed storage system that will support not just a distributed file system, but other storage services. We plan to evolve the Datanodes’ fault-tolerant block storage layer to a generalized subsystem over which to build other storage services such as HDFS and Object store, etc. We introduce the abstraction of a storage-container that is replicated for reliability. The first two container types are Block-Container and Object-container. A Block-Container is a collection of HDFS blocks replicated as a unit. It will allow block scalability with low block-report overhead while allowing co-location of related files. An Object-Container has a very large number of typically much smaller objects and is targeted towards an object-store service (like S3). We also plan on more structured storage container such as LSM-trees to support HBase in a more first class way. Our approach has several benefits. It allows the Datanode’s physical storage to be shared across different storage services without fragmentation. A storage container also isolates implementation and client protocols allowing each container type to evolve independently. Further container implementations can share common features such as replication, location-service and overall management of containers and its storage including functions like decommissioning.

Your email address will not be published. Required fields are marked *

Comment

Name*

Email*

Related Posts

BLOG

6.9.17

TMW: From Adoption to Profitability

The San Jose DataWorks Summit is this Tuesday! Have you registered? A few weeks ago, we announced Tim Leonard of TMW Systems would be presenting a Tuesday keynote, Commoditizing Your Data to Sell – A Transportation Example. Lessons from this keynote are vital to any small business looking to compete with major players in an industry.…

Announcing 2017 European Hortonworks Data Heroes...

Not only will DataWorks Summit/Hadoop Summit Munich be my first in Europe since joining Hortonworks but it is also the global launch of our Data Heroes Awards. On the evening before each of the three conferences begin, we will both recognise our nine finalists and announce the winners from each category! With the awards, Hortonworks…

Hortonworks 2016 Year in Review

As we kick off the new year I wanted to thank our customers, partners, Apache community members, and of course the amazing Hortonworks team, for an amazing 2016. Let’s take a step back and look at some of the Hortonworks highlights from last year... IN THE ECOSYSTEM there was tremendous acceleration. At the beginning of…

The Power of your Data Achieved:...

It’s no secret that there is a data explosion. A recent IDC analyst report from April 2014 indicated the volume of data, known as the digital universe, is doubling in size every two years. And by 2020, there will be as many digital bits as there are stars in the universe. There are many reasons…

Jumpstart Your Digital Transformation with Hadoop...

Guest author: Jeff Kelly, Data Strategist, Pivotal The phrase “digital transformation” gets bandied about a lot these days, but what exactly does it mean? When you strip away the hyperbole, I believe digital transformation is the process by which enterprises evolve from using traditional information technology to merely support existing business models to adopting modern…

What’s the best cloud architecture—and how...

People often think about cloud architecture in simplistic terms: you’re either public, private, or hybrid. (In fact, there’s even confusion about the meaning of the term “hybrid” itself—this video helps clear it up: https://www.youtube.com/watch?v=HPKI-U_ef5w In the real world, of course, virtually every implementation is hybrid—no company puts 100% of its IT environment into one single…

The 100% open source and community driven innovation of Apache Hive 2.0 and LLAP (Long Last and Process) truly brings agile analytics to the next level. It enables customers to perform sub-second interactive queries without the need for additional SQL-based analytical tools, enabling rapid analytical iterations and providing significant time-to-value. TRY HIVE LLAP TODAY Read about…

If You Think Cloud, Think Connected...

Cloud Computing is one of the big three trends impacting IT architectures today. What some may not realize is that an underlying connected data architecture is not only essential for cloud, but sits at the confluence of all three trends. Here's why. The first big trend is IoT. According to BI Intelligence, we can now…

Insights Aggregation and Predictive Analytics within...

How Hortonworks can help hotel industry capture value through Insights Aggregation and Predictive Analytics Big Data has transformed every industry including the hospitality vertical. Through customer analytics, targeted segmentation, and campaigning, hotels would like to focus on delivering personalized promotions, cross and up-selling travel services. Our objective is to address these challenges through an open-source…