Brandon Hieb, Managing Partner at Bit Refinery, is our guest blogger, and in this blog, he provides insight to virtualizing Hadoop infrastructures.

Here at Bit Refinery we provide infrastructure for companies large and small which includes a variety of big data applications running on both bare-metal and VMware servers. With this new technology constantly changing, it’s hard to keep up with the different required resources needed which could range from a traditional Hadoop node to an in-memory application such as Apache Ignite. In this blog post, we hope to provide some valuable insight on infrastructure guidelines based on our experiences and current big data customers.

Do:

Embrace redundancy, use commodity servers

Start small and stay focused

Monitor, monitor and monitor

Create a data integration process

Use compression

Build multiple environments (Dev, Test, Prod)

Don’t

Mix master nodes with data nodes

Virtualize data nodes

Overbuild

Panic – Google is your friend

Let the Wild Wild West take over!

Do

Embrace redundancy, use commodity servers

We talk to many companies that foray into Hadoop by business users or an analytics group within the company. Oftentimes the infrastructure folks are brought in at a later date and a majority do not have any training or knowledge of how Hadoop works. This usually leads to an overdesigned cluster that triples or even quadruples the budget. Hadoop was mainly created because the founders wanted a low-cost, redundant data store that would allow deep analysis of the data. This can be achieved using low cost servers with JBOD (just a bunch of disks) and single power supplies. Companies such as Thinkmate (a SuperMicro reseller) sell ideal Hadoop nodes in the 5k range.

Start small and stay focused

We’ve all seen the statistics of how many projects fail in companies due to the level of complexity and expense. The beauty of Hadoop is it allows you to start small and add nodes as you go. Choose a small project to get started which allows both development and infrastructure staff to become familiar with the interworkings of this new technology. They will be hooked in no time!

Monitor, Monitor, Monitor

Although Hadoop offers redundancy at the data level and management level, there are lots of moving parts to be monitored. Hortonworks comes with Nagios which is the leading open-source monitoring package available. By default, it monitors all the nodes and services in a cluster. With Nagios, it’s easy to add some additional checks such as disk health in each server.

Create a data integration process

One of the best things about Hadoop is it lets you populate it with data and define data structures at a later time. Getting data in and out is pretty easy with tools such as Sqoop and Flume but creating a data integration process up front is essential. This includes different layers such as staging and base as well as naming standards and locations. Creating a wiki is a great way to keep proper documentation of data sources and where they live within the cluster.

Build multiple environments (Dev, Test, Prod)

Just like any other infrastructure project, we always advise our customers to build multiple environments. Not only is this a general best practice but it is also important because of the nature of Hadoop. Each project within the Apache Ecosystem is constantly changing and having a non-production environment to test upgrades and new functionality is vital.

Don’t

Mix master nodes with data nodes

Master nodes and data nodes play two vastly different roles. Master nodes should be placed on servers that have fully redundant features such as RAID and multiple power adapters. They also play well in a virtual environment due to the extra redundancy this provides. Data nodes on the other hand are the work horses of the cluster and need to be dedicated solely to the important function. Mixing these roles together on the same server usually leads to unwanted results and issues.

Virtualize data nodes

We see companies out there touting “Hadoop in the Cloud” and provide clusters located on virtualized servers. Although locating master nodes on virtualized servers isn’t a bad idea, having them act as data nodes is a no-no. The concept of Hadoop is bringing the processing to the data and not the other way around. Having data nodes all share the same storage infrastructure mostly nullifies all the great benefits that Hadoop provides.

Overbuild

It’s easy to get carried away building your first cluster. The costs of hardware and software are low but it’s important to only build to what your initial requirements require. You may find the specifications of the servers you chose need to be altered based on the results and performance of your initial project. It’s easy to add, not as easy to take away.

Panic – Google is your friend

When things go wrong, don’t panic! Unlike other commercial software, Hadoop is driven by the open source community and there is a very good chance the problem you are having is just a quick Google search away. Luckily just within the last 2 years, we’ve encountered less and less issues and it’s amazing how mature the Hadoop ecosystem has become.

Let the Wild Wild West take over!

This is where the hype tramples on best practice. We’ve constantly heard that Hadoop is a great “data lake” where you can just put all of your data and deal with it later. This is very true although just like any other data repository, you need to instill best practices, documentation and rules or by the time you turn around, you will have an out of control tsunami of data!

Summary

Hadoop is great data platform that continues to add new features and functionality quicker than any commercial software vendor would be able to. We have been doing this for almost two years now and we are in a constant state of learning new tips and tricks to gain the most of our customer’s Hadoop clusters. Creating a common knowledgebase and best practices ensures both users and management are able to quickly gain confidence in this new and exciting technology.

Tags:

Comments

Good article except I think the “no-no” for virtual data nodes is an extreme statement. Well designed clusters on mature public cloud platforms can perform very well and satisfy most requirements.
Keep calm and Hadoop on !!

Your email address will not be published. Required fields are marked *

Comment

Name*

Email*

Related Posts

BLOG

5.31.17

Mitsubishi Fuso Selects Hortonworks to Power...

Yesterday we announced that Mitsubishi Fuso Truck and Bus Corporation has deployed Microsoft Azure HDInsight, powered by Hortonworks Data Platform (HDP ®), in the public cloud to power the company’s connected data architecture. Notably, “Mitsubishi Fuso’s big data strategy began in 2014 and since then the company has undergone a process to modernize all operations.…

Hortonworks 2016 Year in Review

As we kick off the new year I wanted to thank our customers, partners, Apache community members, and of course the amazing Hortonworks team, for an amazing 2016. Let’s take a step back and look at some of the Hortonworks highlights from last year... IN THE ECOSYSTEM there was tremendous acceleration. At the beginning of…

The Power of your Data Achieved:...

It’s no secret that there is a data explosion. A recent IDC analyst report from April 2014 indicated the volume of data, known as the digital universe, is doubling in size every two years. And by 2020, there will be as many digital bits as there are stars in the universe. There are many reasons…

Jumpstart Your Digital Transformation with Hadoop...

Guest author: Jeff Kelly, Data Strategist, Pivotal The phrase “digital transformation” gets bandied about a lot these days, but what exactly does it mean? When you strip away the hyperbole, I believe digital transformation is the process by which enterprises evolve from using traditional information technology to merely support existing business models to adopting modern…

What’s the best cloud architecture—and how...

People often think about cloud architecture in simplistic terms: you’re either public, private, or hybrid. (In fact, there’s even confusion about the meaning of the term “hybrid” itself—this video helps clear it up: https://www.youtube.com/watch?v=HPKI-U_ef5w In the real world, of course, virtually every implementation is hybrid—no company puts 100% of its IT environment into one single…

The 100% open source and community driven innovation of Apache Hive 2.0 and LLAP (Long Last and Process) truly brings agile analytics to the next level. It enables customers to perform sub-second interactive queries without the need for additional SQL-based analytical tools, enabling rapid analytical iterations and providing significant time-to-value. TRY HIVE LLAP TODAY Read about…

If You Think Cloud, Think Connected...

Cloud Computing is one of the big three trends impacting IT architectures today. What some may not realize is that an underlying connected data architecture is not only essential for cloud, but sits at the confluence of all three trends. Here's why. The first big trend is IoT. According to BI Intelligence, we can now…

Insights Aggregation and Predictive Analytics within...

How Hortonworks can help hotel industry capture value through Insights Aggregation and Predictive Analytics Big Data has transformed every industry including the hospitality vertical. Through customer analytics, targeted segmentation, and campaigning, hotels would like to focus on delivering personalized promotions, cross and up-selling travel services. Our objective is to address these challenges through an open-source…