November 6, 2011

Subscribe

What is a “Hadoop”? Explaining Big Data to the C-Suite

Keep hearing about Big Data and Hadoop? Having a hard time explaining what is behind the curtain?

The term “big data” comes from computational sciences to describe scenarios where the volume of the data outstrips the tools to store it or process it.

Three reasons why we are generating data faster than ever: (1) Processes are increasingly automated; (2) Systems are increasingly interconnected; (3) People are increasingly “living” online.

As huge data sets invaded the corporate world there are new tools to help process big data. Corporations have to run analysis on massive data sets to separate the signal from the noisy data. Hadoop is an emerging framework for Web 2.0 and enterprise businesses who are dealing with data deluge challenges – store, process, index, and analyze large amounts of data as part of their business requirements.

So what’s the big deal? The first phase of e-commerce was primarily about cost and enabling transactions. So everyone got really good at this. Then we saw differentiation around convenience… fulfillment excellence (e.g., Amazon Prime) , or relevant recommendations (if you bought this and then you may like this – next best offer).

Then the game shifted as new data mashups became possible based on… seeing who is talking to who in your social network, seeing who you are transacting with via credit-card data, looking at what you are visiting via clickstreams, influenced by ad clickthru, ability to leverage where you are standing via mobile GPS location data and so on.

The differentiation is shifting to turning volumes of data into useful insights to sell more effectively. For instance, E-bay apparently has 9 petabytes of data in their Hadoop and Teradata cluster. With 97 million active buyers and sellers they have 2 Billion page view and 75 billion database calls each day. E-bay like others is racing to put in the analytics infrastructure to (1) collect real-time data; (2) process data as it flows; (3) explore and visualize.

The continuous challenge in Web 2.0 is how to improve site relevance, performance, understand user behavior, and predictive insight to influence decisions.

This is a never ending arms race as each firm tries to become the portal of choice in a fast changing world. Industries – Travel, Retail, Financial Services, Digital Media, Search etc. — that are consumer oriented are all facing similar real-time information dynamics.

Take for instance, the competitive world of travel – airline, hotel, car rental, vacation rental etc.. Every site has to improve at analytics and machine learning as the contextual data is changing by the second – inventory, pricing, customer comments, peer recommendations, political/economic hotspots, natural disasters like earthquakes etc. Without a sophisticated real-time analytics playbook, sites can become less relevant very quickly.

Hadoop has rapidly emerged as a viable platform for Big Data analytics. Many experts believe Hadoop will subsume many of the data warehousing tasks presently done by traditional relational systems. This will be a huge shift in how IT apps are engineered.

Hadoop Quick Overview

So, What is Apache Hadoop ? A scalable fault-tolerant distributed system for data storage and processing (open source under the Apache license).

Selecting the right tool for the right job….. Technically, Hadoop, open-source MapReduce framework (written in Java), can store and process gobs of data across many commodity computers. Use Hadoop when dealing with (1) Structured or Not (Flexibility); (2) Scalability of Storage/Compute; (3) Complex Data Processing.

Traditional relational databases and data warehouse products excel at OLAP and OLTP workloads over structured data. These form the underpinnings of most IT applications. Use relational databases when dealing with (1) Interactive OLAP Analytics; (2) Multistep ACID Transactions and (3) 100% SQL Compliance.

It is becoming increasingly more difficult for classic techniques to support the wide range of use cases and workloads that power the next wave of digital business.

Hadoop is designed to solve a different problem: the fast, reliable analysis of both structured, unstructured and complex data.

As a result, many IT Engineering teams are deploying the Hadoop ecosystem alongside their legacy IT applications, which allows them to combine old data and new data sets in powerful new ways. It also allows them to offload analysis from the data warehouse (and in some cases, pre-process before putting in the data warehouse).

Technically, Hadoop, a Java based framework, consists of two elements: reliable very large, low-cost data storage using the Hadoop Distributed File System (HDFS) and high-performance parallel/distributed data processing framework called MapReduce.

Scenarios for Using Hadoop

When a user types a query, it isn’t practical to exhaustively scan millions of items. Instead it makes sense to create an index and use it to rank items and find the best matches. Hadoop provides a distributed indexing capability.

Hadoop runs on a collection/cluster of commodity, shared-nothing x86 servers. You can add or remove servers in a Hadoop cluster (sizes from 50, 100 to even 2000+ nodes) at will; the system detects and compensates for hardware or system problems on any server. Hadoop is self-healing and fault tolerant. It can deliver data — and can run large-scale, high-performance processing batch jobs — in spite of system changes or failures.

Three distinct scenarios for Hadoop are:

1) Hadoop as an ETL and Filtering Platform – One of the biggest challenges with high volume data sources is extracting valuable signal from lot of noise. Loading large, raw data into a MapReduce platform for initial processing is a good way to go. Hadoop platforms can read in the raw data, apply appropriate filters and logic, and output a structured summary or refined data set. This output (e.g., hourly index refreshes) can be further analyzed or serve as an input to a more traditional analytic environment like SAS. Typically a small % of a raw data feed is required for any business problem. Hadoop becomes a great tool for extracting these pieces.

2) Hadoop as an exploration engine – Once the data is in the MapReduce cluster, using tools to analyze data where it sits makes sense. As the refined output is in a Hadoop cluster, new data can be added to the existing pile without having to re-index all over again. In other words, new data can be added to existing data summaries. Once the data is distilled, it can be loaded into corporate systems so users have wider access to it.

3) Hadoop as an Archive. Most of the historical data doesn’t need to be accessed and kept in a SAN environment. This historical data is usually archived by tape or disk to secondary storage or sent offsite. When this data is needed for analysis, it’s painful and costly to retrieve it and load it back up… so most people don’t bother using historical data for their analytics. With cheap storage in a distributed cluster, lot’s of data can be kept “active” for continuous analysis. Hadoop is efficient…it allows better utilization of hardware by allowing the generation of different index types in one cluster.

The Hadoop Stack

It’s important to differentiate Hadoop from the Hadoop stack. Firms like Cloudera sell a set of capabilities around Hadoop called the Cloudera’s Distribution for Hadoop (CDH).

This is a set of projects and management tools designed to lower the cost and complexity of administration and production support services; this includes 24/7 problem resolution support, consultative support, and support for certified integrations.

The 2014 version of CDH:

The 2011 version of CDH looked like this:

The 2012 version of CDH (Apache) looks like this:

The introduction of Hadoop stack is changing the business intelligence (reporting/analytics/data mining), which has been dominated by very expensive relational databases and data warehouse appliance products.

What is Hadoop good for?

Searching, log processing, recommendation systems, data warehousing, video and image analysis, archiving seem to be the initial uses. One prominent space where Hadoop is playing a big role in is data-driven online websites. The four primary areas include:

Let’s look at a few realworld examples from LinkedIn, CBS Interactive, Explorys and FourSquare. Walt Disney, Wal-mart, General Electric, Nokia, and Bank of America are also applying Hadoop to a variety of tasks including marketing, advertising, and sentiment and risk analysis. IBM used the software as the engine for its Watson computer, which competed with the champions of TV game show Jeopardy.

Hadoop @ LinkedIn

LinkedIn is a massive data hoard whose value is connections. It currently computes more than 100 billion personalized recommendations every week, powering an ever growing assortment of products, including Jobs You May be Interested In, Groups You May Like, News Relevance, and Ad Targeting.

LinkedIn leverages Hadoop to transform raw data to rich features using knowledge aggregated from LinkedIn’s 125 million member base. LinkedIn then uses Lucene to do real-time recommendations, and also Lucene on Hadoop to bridge offline analysis with user-facing services. The streams of user-generated information, referred to as a “social media feeds”, may contain valuable, real-time information on the LinkedIn member opinions, activities, and mood states.

CBS Interactive – Leveraging Hadoop

CBS Interactive is using Hadoop as the web analytics platform, processing one Billion weblogs daily (grown from 250 million events per day) from hundreds of web site properties.

Who is CBS Interactive? They are the online division for CBS, the broadcast network. They are a top 10 global web property and the largest premium online content network. Some of the brands include: CNET, Last.fm, TV.com, CBS Sports, 60 Minutes, to name a few.

CBS Interactive migrated processing from a proprietary platform to Hadoop to crunch web metrics. The goal was to achieve more robustness, fault-tolerance and scalability, and significant reduction of processing time to reach SLA (over six hours reduction so far). To enable this they built an Extraction, Transformation and Loading ETL framework called Lumberjack, built based on python and streaming.

Explorys and Cleveland Clinic

Explorys, founded in 2009 in partnership with the Cleveland Clinic, is one of the largest clinical repositories in the United States with 10 million lives under contract. The Explorys healthcare platform is based upon a massively parallel computing model that enables subscribers to search and analyze patient populations, treatment protocols, and clinical outcomes. With billions of clinical and operational events already curated, Explorys helps healthcare leaders leverage analytics for break-through discovery and the improvement of medicine. HBase and Hadoop are at the center of Explorys. Already ingesting billions of anonymized clinical records, Explorys provides a powerful and HIPAA compliant platform for accelerating discovery.

Hadoop @ Orbitz

Travel – air, hotel, car rentals – is an incredibly competitive space. Take the challenge of hotel ranking. Orbitz .com generates ~1.5 million air searches and ~1 million hotel searches a day in 2011. All this activity generates massive amounts of data – over 500 GB/day of log data. The challenge was expensive and difficult to use existing data infrastructure for storing and processing this data.

Orbitz needed an infrastructure that provides (1) long term storage of large data sets; (2) open access for developers and business analysts; (3) ad-hoc quering of data and rapid deploying of reporting applications. They moved to Hadoop and Hive to provide reliable and scalable storage and processing of data on inexpensive commodity hardware. Hive is an open-source data warehousing solution built on top of Hadoop which allows easy data summarization, adhoc querying and analysis of large datasets stored in Hadoop. Hive simplifies Hadoop data analysis — users can use SQL rather than writing low level custom code. Highlevel queries are compiled into Hadoop Mapreduce jobs.

Hadoop @ Foursquare

foursquare is a mobile + location + social networking startup aimed at letting your friends in almost every country know where you are and figuring out where they are.

As a platform foursquare is now aware of 25+ million venues worldwide, each of which can be described by unique signals about who is coming to these places, when, and for how long. To reward and incent users foursquare allows frequent users to collect points, prize “badges,” and eventually, coupons, for check-ins.

Foursquare is built on enabling better mobile + location + social networking by applying machine learning algorithms to the collective movement patterns of millions of people. The ultimate goal is to build new services which help people better explore and connect with places.

Foursquare engineering employs a variety of machine learning algorithms to distill check-in signals into useful data for app and platform. foursquare is enabled by a social recommendation engine and real-time suggestions based on a person’s social graph.

“With over 500 million check-ins last year and growing, we log a lot of data. We use that data to do a lot of interesting analysis, from finding the most popular local bars in any city, to recommending people you might know. However, until recently, our data was only stored in production databases and log files. Most of the time this was fine, but whenever someone non-technical wanted to do some ad-hoc data exploration, it required them knowing SCALA and being able to query against production databases.

This has become a larger problem as of late, as many of our business development managers, venue specialists, and upper management eggheads need access to the data in order to inform some important decisions. For example, which venues are fakes or duplicates (so we can delete them), what areas of the country are drawn to which kinds of venues (so we can help them promote themselves), and what are the demographics of our users in Belgium (so we can surface useful information)?”

What Data Projects is Hadoop Driving?

Now that we have dispensed with examples and you are eager to get started, what data driven projects do you undertake?

Where do you start? Do you do legacy application retrofits to leverage the Hadoop stack? Do you do large scale data center upgrades to handle the terabytes – petabytes – exabytes load? Do you do a Proof of Concept (PoC) project to investigate technologies?

Hadoop and Big Data Vendor Landscape…

Which vendors do you engage with for what. How do you make sense of the landscape?

Lots of new entrants in this space that are raising the innovation quotient. According to GigaOM:

“Cloudera which is synonymous with Hadoop has raised $76 million since 2009 [Has raised another $65 million in December 2012]. Newcomers MapR and Hortonworks have raised $29 million and $50 million. And that’s just at the distribution layer, which is the foundation of any Hadoop deployment. Up the stack, Datameer, Karmasphere and Hadapt have each raised around $10 million, and then are newer funded companies such as Zettaset, Odiago and Platfora. Accel Partners has started a $100 million big data fund to feed applications utilizing Hadoop and other core big data technologies. If anything, funding around Hadoop should increase in 2012, or at least cover a lot more startups.”

The data growth chart below gives you a sense of how quickly we are creating digital data. A few years ago Terabytes were considered a big deal. Now Exabytes are the new Terabytes. Making sense of large data volumes at real-time speed is where we are heading.

Hadoop based analytic complexity grows as data mining, predictive modeling and advanced statistics become the norm. Usage growth is driving the need for more analytical sophistication.

Hadoop’s framework brings a new set of challenges related to the compute infrastructure and underlined network architectures. As Hadoop graduates from pilots to a mission critical component of the enterprise IT infrastructure, integrating information held in Hadoop and in Enterprise RDBMS becomes imperative.

Finally, adoption of Hadoop in the enterprise will not be an easy journey, and the hardest steps are often the first. Then, they get harder! Weaning the IT organizations off traditional DB and EDW models to use a new approach can be compared to moving the moon out of its orbit with a spatula… but it can be done.

2013-2014 may be the year Hadoop crosses into mainstream IT.

Sources and References

1) Hadoop is a Java-based software framework for distributed processing of data-intensive transformations and analyses.

MapReduce breaks a big data problem into subproblems; distributes them onto tens, hundreds, and even thousands of processing nodes; and then combines the results into a smaller, easy-to-analyze data set. Think of this as an efficient parallel assembly line for data analysis. MapReduce was first presented to the world via a 2004 white paper in which Google. Yahoo re-implemented this technique and open sourced it via the Apache foundation.

HBase is an open-source, distributed, versioned, column-oriented store modeled after Google’s BigTable architecture. HBase scales to billions of rows and millions of columns, while ensuring that write and read performance remain constant.

Hive is a SQL’esque query language for interrogating Apache Hadoop

Pig, a high-level query language for large-scale data processing

ZooKeeper, a toolkit of coordination primitives for building distributed systems

3) Hadoop ecosystem is evolving constantly so this makes it tricky for enterprise IT adoption which tends to like stable proven models with a big maintenance tail.

4) It’s important to understand what Hadoop doesn’t do…. Big Data technology like the Hadoop stack does not deliver insight, however – insights depend on analytics that result from combing the results of things like Hadoop MapReduce jobs with manageable “small data” already in the Data warehouse (DW).

Great article and overview. Appreciate the links to the case studies as well. You should read about a company in Charlotte, NC doing some impressive things regarding big data and hadoop for financial services, Tresata at http://www.tresata.com. Abhishek and Richard are visionaries in the space. (Note: I have no affiliation with the people or the company.)

Ravi…great insight into Big Data and Hadoop. These things are generally considered to be difficult to sell to the senior management considering there reliance on tradition BI using enterprise data warehouses. Your insight on how Big data generates business value is very informative.

Thanks for a marvelous posting! I truly enjoyed reading
it, you might be a great author. I will always bookmark your blog and definitely will come back in the foreseeable future.
I want to encourage continue your great job, have a nice day!

I don’t even understand how I stopped up here, however I thought this publish used to be great. I do not recognise who you’re but definitely you are going to a famous blogger when you aren’t already. Cheers!

Brilliant information about Hadoop which is very useful for Hadoop Online Learners. I actually enjoy this post and the internet site all in all! Your piece of writing is really plainly composed as well as simply understandable. Your current Blog design is awesome as well! Please maintain up the very good job. We all require far more such website owners like you on the net and much fewer spammers. Fantastic mate! Regards, hadoop online training hyderabad

Defining Business Analytics

What is Business Analytics? Business Analytics is the intersection of business and technology, offering new opportunities for a competitive advantage. Business analytics unlocks the predictive potential of data analysis to improve financial performance, strategic management, and operational efficiency.

What is BI? BI is the "computer-based techniques used in spotting, digging-out, and analyzing 'hard' business data, such as sales revenue by products or departments or associated costs and incomes. Objectives of BI implementations include (1) understanding of a firm's internal and external strengths and weaknesses, (2) understanding of the relationship between different data for better decision making, (3) detection of opportunities for innovation, and (4) cost reduction and optimal deployment of resources." (Business Dictionary). Most widely used BI tool is Microsoft Excel.
-----------------------------
What is Big Data? Big data refer to data scenarios that grow so large (petabytes and more) that they become awkward to work with using traditional database management tools. The challenge stems from data volume + flow velocity + noise to signal conversion. Big data is spawning new tools that are mix of significant processing power, parallelism and statistical, machine learning, or pattern recognition techniques
----------------------------

Corporate performance management software and performance management concepts, such as the balanced scorecard, enable organizations to measure business results and track their progress against business goals in order to improve financial performance.
-----------------------------

Data visualization tools, include mashups, executive dashboards, performance scorecards and other data visualization technology, is becoming a major category.
-----------------------------

BI platforms provide a range of capabilities for building analytical applications. Examples are Oracle OBIEE, SAP Business Objects 4.0. There are many choices and combinations of BI platforms, capabilities and use cases as well as many emerging BI technologies such as in memory analytics, interactive visualization and BI integrated search. The idea of standardizing on one supplier for all of one’s BI capabilities is difficult to do. Increasingly, standardization and more about managing a portfolio of tools used for a set of capabilities and use cases.
----------------------------
Data integration tools and architectures in support of BI continue to evolve. Extract-Transfer-Load (ETL) tools make up a big segment of this category in addition to data mapping tools. Organizations must now support a range of delivery styles, latencies, and formats.
----------------------------
BI is about "sense and respond." Analytics is about "anticipate and shape" models.

About

Business Analytics 3.0 blog is meant for decision makers and managers who are trying to make sense of the rapidly changing technology landscape and build next generation solutions. It is aimed at helping business decision makers navigate the "Raw Data -> Aggregate Data -> Intelligence -> Insight -> Decisions" chain.