Hadoop for business: Analytics across industries

In this episode of the O’Reilly Podcast, O’Reilly’s Ben Lorica chats with Ben Sharma, CEO and co-founder of Zaloni, a company that provides enterprise data management solutions for Hadoop. Sharma was one of the first users of Apache Hadoop, and has a background in enterprise solutions architecture and data analytics.

Before starting Zaloni, Sharma spent many years as a business consultant and began to see that companies across industries were struggling to process, store, and extract value from their data. Having worked extensively in telecom, Sharma helped equipment vendors deploy large-scale network infrastructures at carriers across the world. He began to see how Hadoop could have an impact in the business analytics aspect of companies, not just in IT.

In this interview, Lorica and Sharma discuss the early days of Hadoop and how businesses across industries are benefitting from Hadoop. They also discuss the evolution of tools in the space and how more companies are moving toward real-time decision-making with the growth of streaming tools and real-time data.

Making Hadoop have an impact in business analytics

In the early days at Zaloni, Sharma had to make a concerted effort to convince large enterprises that Hadoop was the right solution for their data pipeline. At that time, around 2009, there was still some hesitation among large enterprises in adopting Hadoop because enterprise data warehouses had many of the enterprise features companies were already comfortable using.

One of our first customers had a job that would run in a traditional Oracle environment for 36 to 40 hours during a quarter end process. What they were trying to do is get information from their sales data about the equipment renewal. This is an equipment company — they make network equipment and they sell a lot of it. Their equipment has a warranty, and the warranties come up for renewal.

What they wanted to do is, during a quarter end process, go through all those warranties they’ve collected — 30 million records or so — and identify which customers are coming up for renewal so their sales team can pursue those customer opportunities and get more revenue. It takes them 36 to 40 hours to process this job during a quarter end process. What happens is this runs on their traditional Oracle system, which is also their Oracle ERP system that is doing order entry for all the orders that are coming in. At the end of the quarter, the sales team is trying to get in more orders and at the same time, this job is running, which is slowing down the whole system.

…

We were able to take that same workload and run it on a very simple Hadoop cluster and run that job in two hours or so. That opened their eyes. … Now they do not slow down their order entry process, they can enable their sales teams to be more agile and be able to get these reports on a more regular basis instead of during the quarter end. That is business value that they were able to show to their line of business, and they were able to quickly get that project on-boarded in Hadoop as one of their first use cases.

Metadata and the validation of on-boarding data

A topic that is becoming more important in processing data is metadata. Practitioners working with data are finding it useful for many things, especially lineage and provenance, to help understand what others have done to the data downstream. Being able to see what’s been done to your data enables you to create a repeatable process.

We start from managed data ingestion, bringing data into the Hadoop environment, then we’re able to capture metadata as the data is coming in and able to define business and technical metadata along the way.

…

We saw a need for metadata from early on in terms of not just the schema, if you will — because schema is needed as you process the data and is a given — but also the business metadata, so you can define some of the business attributes that are needed on the data, and also the operational metadata so you can maintain the provenance or the lineage of the data.

…

The other aspect we were seeing more and more is that in a traditional data warehouse environment or a database environment, you had a lot of constraints and other things that you can define on the tables as you’re onboarding the data so that you can reject bad data. As you know, Hadoop is schema-less, where these checks are not done as you ingest the data … Being able to do that validation after you ingest the data, in a Hadoop-friendly way, and still being able to get some measure of data quality was also important because that would drive a lot of the decisions in terms of analytics that were built out of the data. In a lot of cases, we were already seeing our customers starting to onboard data from external third parties where they needed to have these data quality checks, or they were getting data from machines and non-human-entered data, which also needed data quality checks because configurations and other things could be different.

Those are also aspects that were part of what we were trying to put together as a reusable platform for data management.

Real-time decision-making

In his experience working with large-scale enterprises, Sharma has found that the window for making business decisions today is more compressed than ever before. Decision makers are turning toward their streaming data to make decisions faster. Sharma gives us an example:

If you look at how companies are transitioning, from their data journey perspective, a lot of the data that is being generated is no longer transactional data generated by a human doing some action. A lot of the data is being generated by sensors or machines that are being deployed in their customers’ environments.

…

I was talking to a customer yesterday who makes printers, and they are putting sensors in the printer heads so they know if anything goes wrong — they are getting 20 to 50 data points in an hour or sometimes more, and they have hundreds of thousands of these high-end printers deployed all over the world that are sending them data. They need to be able to collect and react to this data so they can provide a better uptime of these high-end printers to their customers.

From libraries to applications

As components in the big data ecosystem continue to mature, Sharma is seeing growing interest in vertical applications:

We definitely see the need for being able to build applications rapidly based on proven platforms. … We see that a lot of our customers who are using Hadoop and the higher level frameworks on top of Hadoop are very focused on how to build a data-centric application, how to build a data product on top of this data platform. That is more of the focus of the customers that we’re working with right now.

You can listen to our entire interview in the SoundCloud player above or in our SoundCloud playlist.