Tag Archives: hadoop

Post navigation

I’ve been seeing great acceptance on the concept of ops Ready State. Technologists from both ops and dev immediately understand the need to “draw a line in the sand” between system prep and installation. We also admit that getting physical infrastructure to Ready State is largely taken for granted; however, it often takes multiple attempts to get it right and even small application changes can require a full system rebuild.

Since even small changes can redefine the ready state requirements, changing Ready State can feel like being told to tear down your house so you remodel the kitchen.

A friend asked me to explain “Ready State” in non-technical terms. So far, the best analogy that I’ve found is when a house is “Roughed In.” It’s helpful if you’ve ever been part of house construction but may not be universally accessible so I’ll explain.

Getting to Rough In means that all of the basic infrastructure of the house is in place but nothing is finished. The foundation is poured, the plumbing lines are placed, the electrical mains are ready, the roof on and the walls are up. The house is being built according to architectural plans and major decisions like how many rooms there are and the function of the rooms (bathroom, kitchen, great room, etc). For Ready State, that’s like having the servers racked and setup with Disk, BIOS, and network configured.

While we’ve built a lot, rough in is a relatively early milestone in construction. Even major items like type of roof, siding and windows can still be changed. Speaking of windows, this is like installing an operating system in Ready State. We want to consider this as a distinct milestone because there’s still room to make changes. Once the roof and exteriors are added, it becomes much more disruptive and expensive to make.

Once the house is roughed in, the finishing work begins. Almost nothing from roughed in will be visible to the people living in the house. Like a Ready State setup, the users interact with what gets laid on top of the infrastructure. For homes it’s the walls, counters, fixtures and following. For operators, its applications like Hadoop, OpenStack or CloudFoundry.

Taking this analogy back to where we started, what if we could make rebuilding an entire house take just a day?! In construction, that’s simply not practical; however, we’re getting to a place in Ops where automation makes it possible to reconstruct the infrastructure configuration much faster.

While we can’t re-pour the foundation (aka swap out physical gear) instantly, we should be able to build up from there to ready state in a much more repeatable way.

Yes, you can now build the open version of Crowbar and it has the code to configure a bare metal server.

Let me be very specific about this… my team at Dell tests Crowbar on a limited set of hardware configurations. Specifically, Dell server versions R720 + R720XD (using WSMAN and iIDRAC) and C6220 + C8000 (using open tools). Even on those servers, we have a limited RAID and NIC matrix; consequently, we are not positioned to duplicate other field configurations in our lab. So, while we’re excited to work with the community, caveat emptor open source.

Another thing about RAID and BIOS is that it’s REALLY HARD to get right. I know this because our team spends a lot of time testing and tweaking these, now open, parts of Crowbar. I’ve learned that doing hard things creates value; however, it’s also means that contributors to these barclamps need to be prepared to get some silicon under their fingernails.

I’m proud that we’ve reached this critical milestone and I hope that it encourages you to play along.

PS: It’s worth noting is that community activity on Crowbar has really increased. I’m excited to see all the excitement.

Scale out platforms like Hadoop have different operating rules. I heard an interesting story today in which the performance of the overall system was improved 300% (run went from 15 mins down to 5 mins) by the removal of a node.

In a distributed system that coordinates work between multiple nodes, it only takes one bad node to dramatically impact the overall performance of the entire system.

Finding and correcting this type of failure can be difficult. While natural variability, hardware faults or bugs cause some issues, the human element is by far the most likely cause. If you can turn down noise injected by human error then you’ve got a chance to find the real system related issues.

Consequently, I’ve found that management tooling and automation are essential for success. Management tools help diagnose the cause of the issue and automation creates repeatable configurations that reduce the risk of human injected variability.

I’d also like to give a shout out to benchmarks as part of your tooling suite. Without having a reasonable benchmark it would be impossible to actually know that your changes improved performance.

Teaming Related Post Script: In considering the concept of system performance, I realized that distributed human systems (aka teams) have a very similar characteristic. A single person can have a disproportionate impact on overall team performance.

We’re just sprints from release; consequently, it’s time for the Crowbar/OpenStack community to come and play! You can learn Grizzly and help tune the open source Ops scripts.

While the Crowbar team has been generating a lot of noise around our Crowbar 2.0 work, we have not neglected progress on OpenStack Grizzly. We’ve been building Grizzly deploys on the 1.x code base using pull-from-source to ensure that we’d be ready for the release. For continuity, these same cookbooks will be the foundation of our CB2 deployment.

Features of Crowbar’s OpenStack Grizzly Deployments

We’ve had Nova Compute, Glance Image, Keystone Identity, Horizon Dashboard, Swift Object and Tempest for a long time. Those, of course, have been updated to Grizzly.

Swift Object Barclamps made a lot of progress in Folsom that translates to Grizzly

Apache Web Service

Rack awareness

HA configuration

Distribution Report

“Under the covers” improvements for Crowbar 1.x

Substantial improvements in how we configure host networking

Numerous bug fixes and tweaks

Pull from Source via the Git barclamp

Grizzly branch was switched to use Ubuntu & SUSE packages

We’ve made substantial progress, but there are still gaps. We do not have upgrade paths from Essex or Folsom. While we’ve been adding fault-tolerance features, full automatic HA deployments are not included.

Today my boss at Dell, John Igoe, is part of announcing of the report from the TechAmerica Federal Big Data Commission (direct pdf), I was fully expecting the report to be a real snoozer brimming with corporate synergies and win-win externalities. Instead, I found myself reading a practical guide to applying Big Data to government. Flipping past the short obligatory “what is…” section, the report drives right into a survey of practical applications for big data spanning nearly every governmental service. Over half of the report is dedicated to case studies with specific recommendations and buying criteria.

Ultimately, the report calls for agencies to treat data as an asset. An asset that can improve how government operates.

There are a few items that stand out in this report:

Need for standards on privacy and governance. The report calls out a review and standardization of cross agency privacy policy (pg 35) and a Chief Data Officer position each agency (pg 37).

Clear tables of case studies on page 16 and characteristics on page 11 that help pin point a path through the options.

Definitive advice to focus on a single data vector (velocity, volume or variety) for initial success on page 28 (and elsewhere)

I strongly agree with one repeated point in the report: although there is more data available, our ability to comprehend this data is reduced. The sheer volume of examples the report cites is proof enough that agencies are, and will be continue to be, inundated with data.

One short coming of this report is that it does not flag the extreme storage of data scientists. Many of the cases discussed assume a ready army of engineers to implement these solutions; however, I’m uncertain how the government will fill positions in a very tight labor market. Ultimately, I think we will have to simply open the data for citizen & non-governmental analysis because, as the report clearly states, data is growing faster than capability to use it.

I commend the TechAmerica commission for their Big Data clarity: success comes from starting with a narrow scope. So the answer, ironically, is in knowing which questions we want to ask.

“Double wide” is not a term I’ve commonly applied to servers, but that’s one of the cool things about this new class of servers that Dell, my employer, started shipping today.

My team has been itching for the chance to start cloud and big data reference architectures using this super dense and flexible chassis. You’ll see it included in our next Apache Hadoop release and we’ve already got customers who are making it the foundation of their deployments (Texas Adv Computing Center case study).

If you’re tracking the latest big data & cloud hardware then the Dell PowerEdge C8000 is worth some investigation.

Basically, the Dell C8000 is a chassis that holds a flexible configuration of compute or storage sleds. It’s not a blade frame because the sleds minimize shared infrastructure. In our experience, cloud customers like the dedicated i/o and independence of sleds (as per the Bootstrapping clouds white paper). Those attributes are especially well suited for Hadoop and OpenStack because they support a “flat edges” and scale out design. While i/o independence is valued, we also want shared power infrastructure and density for efficiency reasons. Using a chassis design seems to capture the best of both worlds.

The novelty for the Dell PowerEdge C8000 is that the chassis are scary flexible. You are not locked into a pre-loaded server mix.

There are a plethora of sled choices so that you can mix choices for power, compute density and spindle counts. That includes double-wide sleds positively brimming with drives and expanded GPU processers. Drive density is important for big data configurations that are disk i/o hungry; however, our experience is the customer deployments are highly varied based on the planned workload. There are also significant big data trends towards compute, network, and balanced hardware configurations. Using the C8000 as a foundation is powerful because it can cater to all of these use-case mixes.

Not only are we simultaneously releasing both of these solutions, they reflect a significant acceleration in pace of delivery. Both solutions had beta support for their core technologies (Cloudera 4 & OpenStack Essex) when the components were released and we have dramatically reduced the lag from component RC to solution release compared to past (3.7 & Diablo) milestones.

As before, the core deployment logic of these open source based solutions was developed in the open on Crowbar’s github. You are invited to download and try these solutions yourself. For Dell solutions, we include validated reference architectures, hardware configuration extensions for Crowbar, services and support.

The latest versions of Hadoop and OpenStack represent great strides for both solutions. It’s great to be able have made them more deployable and faster to evaluate and manage.