Chapter 5. Cloudy with a Chance of Meatballs: When Clouds Meet Big Data

The Big Tease

As scientific and commercial supercomputing collide with public and private clouds, the ability to design and operate data centers full of computers is poorly understood by enterprises not used to handling 300 million anythings. The promise of a fully elastic and cost-effective computing plant is quite seductive, but Yahoo!, Google, and Facebook solved these problems on their own terms. More conventional enterprises that are now facing either internet-scale computing or a desire to improve the efficiency of their enterprise-scale physical plant will need to identify their own requirements for cloud computing and big data.

Conventional clouds are a form of platform engineering designed to meet very specific and mostly operational requirements. Many clouds are designed by residents of silos that only value the requirements of their own silo. Clouds, like any platform, can be designed to meet a variety of requirements beyond the purely operational. Everyone wants an elastic platform (or cloud), but as discussed in Chapter 2, designing platforms at internet scale always comes with trade-offs, and elasticity does not come free or easy. Big data clouds must meet stringent performance and scalability expectations, which require a very different form of cloud.

The idea of clouds “meeting” big data or big data “living in” clouds isn’t just marketing hype. Because big data followed so closely on the trend of cloud computing, both customers and vendors still struggle to understand the differences from their enterprise-centric perspectives. On the surface there are physical similarities in the two technologies—racks of cloud servers and racks of Hadoop servers are constructed from the same physical components. But Hadoop transforms those servers into a single 1000-node supercomputer, whereas conventional clouds host thousands of private mailboxes.

Conventional clouds consist of applications such as mailboxes and Windows desktops and web servers because those applications no longer saturate commodity servers. Cloud technology made it possible to stack enough mailboxes onto a commodity server so that it could achieve operational efficiency. However, Hadoop easily saturates every piece of hardware it can get its hands on, so Hadoop is a bad fit for a conventional cloud that is used to containing many idle applications. Although everyone who already has existing conventional clouds seems to think big data should just work in them effortlessly, it’s never that easy. Hadoop clouds must be designed to support supercomputers, not idle mailboxes.

A closer look reveals important differences between conventional clouds and big data; most significantly, they achieve scalability in very different ways, for very different reasons. Beyond the mechanisms of scalability that each exploits, the desire to put big data into clouds so it all operates as one fully elastic supercomputing platform overlooks the added complexity that results from this convergence. Complexity impedes scalability.

Scalability 101

Stated simply, the basic concept of scalability defines how well a platform handles a flood of new customers, more friends, or miles of CCTV footage. Achieving scalability requires a deep understanding about what the platform is attempting to accomplish and how it does that. When a river of new users or data reaches flood stage, a platform scales if it can continue to handle the increased load quickly and cost-effectively.

Although the concept of scalability is easy to understand, the strategies and mechanisms used to achieve scalability for a given set of workloads is complex and controversial, often involving philosophical arguments about how to share things—data, resources, time, money—you know, the usual stuff that humans don’t share well. How systems and processes scale isn’t merely a propeller-head computer science topic; scalability applies to barbershops, fast food restaurants, and vast swaths of the global economy. Any business that wants to grow understands why platforms need to scale; understanding how to make them scale is the nasty bit.

The most common notion of scalability comes from outside the data center, in economies of scale. Building cars, burgers, or googleplexes requires a means of production that can cost-effectively meet the growth in demand. In the 20th century, the US economy had an affluent population of consumers who comprised a market so large that if a company’s product developed a national following, it had to be produced at scale.

This tradition of scalable production traces back past Henry Ford, past the War of Independence, and into the ancestral home of the industrial revolution in Europe. During a minor tiff over a collection of colonies, the British Army felt the need to exercise some moral suasion. Their field rifles were serious pieces of engineering—nearly works of art. They were very high quality, both expensive to build and difficult to maintain in the field. The US army didn’t have the engineering or financial resources to build rifles anywhere near that quality, but they knew how to build things simply, cheaply, and with sufficient quality—and these are the tenets of scalable manufacturing.

The only catch to simple, quick, and cheap is quality—not enough quality and the rifles aren’t reliable (will it fire after being dropped in a ditch?), and too much quality means higher costs, resulting in fewer weapons. The British had museum-quality rifles, but if that did not translate into more dead Americans, the quality was wasted and therefore not properly optimized. This is the first lesson of doing things at scale: too much quality impedes scale and too little results in a lousy body count.

Scalability isn’t just about doing things faster; it’s about enabling the growth of a business and being able to juggle the chainsaws of margins, quality, and cost. Pursuit of scalability comes with its own perverse form of disruptive possibilities. Companies that create a great product or service can only build a franchise if things scale, but as their business grows, they often switch to designing for margins instead of market share. This is a great way to drive profit growth until the franchise is threatened.

Choosing margins over markets is a reasonable strategy in tertiary sectors that are mature and not subject to continuous disruption. However, computing technology will remain immature, innovative, and disruptive for some time. It is hard to imagine IT becoming as mature as bridge building, but the Golden Gate bridge was built by the lunatics of their time. Companies that move too far down the margins path can rarely switch back to a market-driven path and eventually collapse under their own organizational weight. The collapse of a company is never fun for those involved, but like the collapse of old stars, the resultant supernova ejects talent out into the market to make possible the next wave of innovative businesses.

Motherboards and Apple Pie

Big data clusters achieve scalability based on pipelining and parallelism, the same assembly line principles found in fast food kitchens. The McDonald brothers applied the assembly line to burgers to both speed up production and reduce the cost of a single burger. Their assembly line consists of several burgers in various states of parallel assembly. In the computer world, this is called pipelining. Pipelining also helps with efficiency by breaking the task of making burgers into a series of simpler steps. Simple steps require less skill; incremental staff could be paid less than British rifle makers (who would, of course, construct a perfect burger). More burgers plus faster plus less cost per burger equals scalability.

In CPU design, breaking down the circuitry pipeline that executes instructions allows that pipeline to run at higher speeds. Whether in a kitchen or a CPU, simple steps make it possible to either reduce the effort required (price) or increase the production (performance) of burgers or instructions. Scalable systems require price and performance to be optimized (not just price and not just performance), all while trying to not let the bacteria seep into the meat left on the counter. Well, that’s the theory.

Once a kitchen is setup to efficiently make a series of burgers, then adding another burger assembly line with another grill and more staff to operate it will double the output of the kitchen. It is important to get the first assembly line as efficient as possible when it’s the example for thousands of new restaurants. Back inside the CPU, it made economic sense for designers to build multiple floating-point functional units because most of the scientific code that was being run by their customers was saturating the CPU with these instructions. Adding another floating point “grill” made the program run twice as fast and customers gladly paid for it.

Scientific and commercial supercomputing clusters are mostly about parallelism with a little bit of pipelining on the side. These clusters consist of hundreds of “grills” all cooking the data in parallel, and this produces extraordinary scalability. Each cluster node performs exactly the same task (which is more complicated than applying mustard and ketchup), but the cluster contains hundreds of identically configured computers all running exactly the same job on their private slices of data.

Being and Nothingness

Big data cluster software like Hadoop breaks up the analysis of a single, massive dataset into hundreds of identical steps (pipelining) and then runs hundreds of copies at once (parallelism). Unlike what happens in a conventional cloud, Hadoop is not trying to cook hundreds of burgers; it’s trying to cook one massive burger. The ability to mince the problem and then charbroil it all together is where the genius lies in commodity supercomputing.

System designers use the term “shared-nothing” to indicate the process of hundreds of nodes working away on the problem while trying not to bother other nodes. Shared-nothing nodes try hard not to share anything. In practice, they do a little sharing, but only enough to propagate the illusion of a single, monolithic supercomputer. On the other hand, shared-everything data architectures emphasize the value gained by having all nodes see a common set of data. Software and hardware mechanisms are required to insure the single view remains correct or coherent but these mechanisms reduce isolation. The need to insure a single, shared view is traded for scalability. There is a class of workloads that does benefit from a single shared view, but in the world of big data, massive scalability is the last thing to be traded away, so shared nothing it is.

When doubled in size, a shared-nothing cluster that scales perfectly operates twice as fast or the job completes in half the time. A shared-nothing cluster never scales perfectly, but getting a cluster to 80% faster when doubled is still considered good scaling. Several platform and workload engineering optimization strategies could be employed to increase the efficiency of a cluster from 80% to 90%. A cluster that scales better remains smaller, which also improves its operational efficiency as well. Hadoop has already been demonstrated to scale to very high node counts (thousands of nodes). But since a 500-node cluster generates more throughput than a 450-node cluster, whether it is 8% faster instead of 11% isn’t as important as Hadoop’s ability to scale beyond thousands of nodes. Very few data processing platforms can achieve that, let alone do it affordably.

Parity Is for Farmers

Hadoop has done for commercial supercomputing what had already been accomplished in scientific supercomputing in the 1990s. Monolithic HPC supercomputers have been around since the 1960s when Seymour Cray designed the Control Data 6600, which is considered one of the first successful supercomputers. He must have worked as a short-order cook in a previous lifetime since his design innovations resemble those from a fast food kitchen. What made the 6600 fast was the pipelining of tasks within the CPU. The steps were broken down to make them easier and cheaper to design and build. After Cray had finished pipelining the CPU circuits, he turned to parallelism by adding extra burger grills designed specifically to increase the number of mathematical or floating-point results that could be calculated in a second.

After Cray left CDC, he took his design one step further in the Cray-1 by noticing that solving hundreds of equations generated thousands of identical instructions (add two floating numbers, divide by a third). The only difference between all these instructions was the values (or operands) used for each instruction. He decided to build a vector-functional unit that would perform 64 sets of calculations at once—burger’s up!

Pipelining and parallelism have been hallmarks of HPC systems like the Cray, but very few corporations could afford a Cray or an IBM mainframe. Decades later, microprocessors became both affordable and capable enough that if you could figure out how to lash a hundred of them together, you might have a shot at creating a poor man’s Cray. In the late 1980s, software technology was developed to break up large monolithic jobs designed to run on a Cray into a hundred smaller jobs. It wasn’t long before some scheduling software skulking around a university campus in the dead of night was infesting a network of idle workstations. If a single workstation was about 1/50th the speed of a Cray, then you only needed 50 workstations to break even. Two hundred workstations were four times faster than the Cray and the price/performance was seriously good, especially since another department probably paid for those workstations.

Starting with early Beowulf clusters in the 1990s, HPC clusters were being constructed from piles of identically configured commodity servers (i.e., pipeline simplicity). Today in 2013, a 1000-node HPC cluster constructed from nodes that have 64 CPU cores and zillions of memory controllers can take on problems that could only be dreamed about in the 1990s.

Google in Reverse

Hadoop software applies the HPC cluster trick to a class of internet-scale computing problems that are considered non-scientific. Hadoop is an evolution of HPC clusters for a class of non-scientific workloads that were plaguing companies with internet-scale, non-scientific datasets. Financial services and healthcare companies also had problems as large as the HPC crowd, but until Hadoop came along, the only way to analyze their big data was with relational databases.

Hadoop evolved directly from commodity scientific supercomputing clusters developed in the 1990s. Hadoop consists of a parallel execution framework called Map/Reduce and Hadoop Distributed File System (HDFS). The file system and scheduling capabilities in Hadoop were primarily designed to operate on, and be tolerant of, unreliable commodity components. A Hadoop cluster could be built on eight laptops pulled from a dumpster and it would work—not exactly at enterprise grade, but it would work. Yahoo! couldn’t initially afford to build 1000-node clusters with anything other than the cheapest sub-components, which is the first rule of assembly lines anyway; always optimize the cost and effort of the steps in the pipeline. Hadoop was designed to operate on terabytes of data spread over thousands of flaky nodes with thousands of flaky drives.

Hadoop makes it possible to build large commercial supercomputing platforms that scale to thousands of nodes and, for the first time, scale affordably. A Hadoop cluster is a couple of orders of magnitude (hundreds of times) cheaper than platforms built on relational technology and, in most cases, the price/performance is several orders of magnitude (thousands of times) better. What happened to the big-iron, expensive HPC business in the early 1990s will now happen to the existing analytics and data warehouse business. The scale and price/performance of Hadoop is significantly disrupting both the economics and nature of how business analytics is conducted.

Isolated Thunderstorms

Everybody understands the value gained from a high-resolution view of consumers’ shopping habits or the possibility of predicting where crimes are likely to take place. This value is realized by analyzing massive piles of data. Hadoop must find a needle, customer, or criminal in a large haystack and do it quickly. Hadoop breaks up one huge haystack into hundreds of small piles of hay. The act of isolating the haystacks into smaller piles creates a scalable, shared-nothing environment so that all the hundreds of piles are searched simultaneously.

Superficially, Hadoop seems to be related to cloud computing because the first cloud implementations came from the same companies attempting internet-scale computing. The isolation benefits that make any shared-nothing platform achieve scalability also make clouds scale. However, clouds and Hadoop achieve isolation through very different mechanisms. The Hadoop isolation mechanism breaks up a single pile of hay into hundreds of piles so each one of those nodes works on its own private pile. Hadoop creates synthesized isolation. But a cloud running a mail application that supports hundreds of millions of users already started out with hundreds of millions of mailboxes or little piles of hay. Conventional clouds have natural isolation. The isolation mechanism results from the initial requirement that each email user is isolated from the others for privacy. Although Hadoop infrastructure looks very similar to the servers running all those mailboxes, Hadoop clusters remain one single, monolithic supercomputer disguised as a collection of cheap, commodity servers.

The Myth of Private Clouds

Cloud computing evolved from grid computing, which evolved from client-server computing, and so on, back to the IBM Selectric. Operating private clouds remains undiscovered country for most enterprises. Grids at this scale are not simply plumbing; they are serious adventures in modern engineering. Enterprises mostly use clouds to consolidate their sprawling IT infrastructure and to reduce costs. Many large enterprises in the 1990s had hundreds or thousands of PCs that needed care and feeding, which mostly meant installing, patching, and making backups. As the workforce became more mobile and laptops replaced desktops, employees’ computers and employers’ sensitive data became more difficult to protect. Early private clouds were set up to move everyone’s “My Documents” folder back into their data center so it could be better secured, backed up, and restored.

When personal computers began to have more computing power than the servers running the back-office accounting systems, most of the spare power went unused by those users who spent their days using MS Office. A single modern workstation in the early 2000s could support 8 or 10 users. This “bare iron” workstation could be carved up into 10 virtual PCs using a hypervisor like VMware’s ESX or Red Hat’s KVM. Each user got about the same experience as having their own laptop gave them, but the company reduced costs.

Conventional clouds consist of thousands of virtual servers, and as long as nothing else is on your server beating the daylights out of it, you’re good to go. The problem with running Hadoop on clouds of virtualized servers is that it beats the crap out of bare-iron servers for a living. Hadoop achieves impressive scalability with shared-nothing isolation techniques, so Hadoop clusters hate to share hardware with anything else. Share no data and share no hardware—Hadoop is shared-nothing.

While saving money is a worthwhile pursuit, conventional cloud architecture is still a complicated exercise in scalable platform engineering. Many private conventional clouds will not be able to support Hadoop because they rely on each individual application to provide its own isolation. Hadoop in a cloud means large, pseudo-monolithic supercomputer clusters lumbering around. Hadoop is not elastic like millions of mailboxes can be. It is a 1000-node supercomputer cluster that thinks (or maybe wishes) it is still a Cray. Conventional clouds designed for mailboxes can be elastic but will not elastically accommodate supercomputing clusters that, as a single unit of service, span 25 racks.

It’s not that Hadoop can’t run on conventional clouds—it can and will run on a cloud built from dumpster laptops, but your SLA mileage will vary a lot. Performance and scalability will be severely compromised by underpowered nodes connected to underpowered external arrays or any other users running guests on the physical servers that contain Hadoop data nodes. A single Hadoop node can easily consume physical servers, even those endowed with 1300+ MB/sec of disk I/O throughput.

A Tiny Data 8-node bare-iron cluster can produce over 10GB/sec. Most clouds share I/O and network resources based on assumptions that guests will never be as demanding as Hadoop data nodes are. Running Hadoop in a conventional cloud will melt down the array because of the way that most SANs are configured and connected to clouds in enterprises. Existing clouds that were designed, built, and optimized for web servers, email, app servers, and Windows desktops simply will not be able to support Hadoop clusters.

The network fabric of a Hadoop cluster is the veins and arteries of HDFS and must be designed for scalable throughput—not attachment and manageability, which is the default topology for conventional clouds and most enterprise networks. If Hadoop is imposed on a conventional cloud, it needs to be a cloud designed to run virtual supercomputers in the first place, not a conventional cloud that has been remodeled, repurposed, and reshaped. Big data clouds have to be designed for big data.

My Other Data Center Is an RV Park

Companies attempting to build clouds or big data platforms that operate at internet-scale are discovering they need to rethink everything they thought they knew about how to design, build, and run data centers. The internet-scale pioneering companies initially purchased servers from vendors like IBM, HP, and Dell, but quickly realized that the value, and therefore pricing, of those servers was driven by features they didn’t need (like graphics chips and RAID cards).

The same features that might be useful for regular customers became completely irrelevant to companies that were going to sling 85,000 servers without ever running a stitch of Windows. Hardware vendors were building general-purpose servers so that their products would appeal to the greatest number of customers. This is a reasonable convention in product design, but like scientific computing users, commercial computing users do not require general-purpose servers to construct purpose-built supercomputing platforms.

At internet-scale, absolutely everything must be optimized, including everything in the computing plant. After learning how to buy or build stripped-down servers, these companies quickly moved on to optimizing the entire notion of a data center, which is just as critical as the server and software to the platform. It was inevitable that the physical plant also had to be deconstructed and optimized.

Today’s data centers can be found in a stack of intermodal containers like those on an ocean freighter. Because a container full of computers only needs power, water for cooling, and a network drop, data centers can also be located on empty lots or abandoned RV parks.

Converge Conventionally

Cloud computing evolved from the need to handle millions of free email users cheaply. Big data evolved from a need to solve an emerging set of supercomputing problems. Grids or clouds or whatever they will be called in the future are simply platforms of high efficiency and scale.

Although clouds are primarily about the operational aspects of a computing plant, learning how to operate a supercomputer is not like operating any traditional IT platform. Managing platforms at this scale is extremely disruptive to the enterprise’s notion of what it means to operate computers. And over time, the care and feeding of thousands of supercomputers will eventually lead us to change the way computers must operate.

The best content for your career. Discover unlimited learning
on demand for around $1/day.