Making Hadoop Part of the Enterprise Data Ecosystem: A Spotlight Q&A with Alex Gorelik of Informatica

BeyeNETWORK Spotlights focus on news, events and products in the business intelligence ecosystem that are poised to have a significant impact on the industry as a whole; on the enterprises that rely on business intelligence, analytics, performance management, data warehousing and/or data governance products to understand and act on the vital information that can be gleaned from their data; or on the providers of these mission-critical products.

Presented as Q&A-style articles, these interviews conducted by the BeyeNETWORK present the behind-the-scene view that you won’t read in press releases.

This BeyeNETWORK spotlight features Ron Powell's interview with Alex Gorelik, Senior Vice President of Research & Development at Informatica. Ron and Alex discuss the how Informatica supports enterprises that adopt Hadoop.

Alex, we hear a lot today about Hadoop and big data, what is the current state of Hadoop adoption in enterprises today based on your experience?

Alex Gorelik: Ron, Hadoop is a great solution for large-scale data analysis. You get all the complex functionality that you need to run very large-scale parallel clusters – fault tolerance, administration, and data processing – and you get MapReduce, which helps you create data processing jobs that can scale out to a massive number of nodes. Therefore, it's a very attractive economic model. It provides low price, very powerful computing and it's also in production at Facebook, Yahoo and other large IT shops. We are seeing a very high rate of adoption, and many of our customers have pilot implementations. They have advanced technology groups or special advanced analytics groups that are mining the clusters and getting wonderful results. They're trying new use cases, and they're trying existing use cases. Where they are getting stuck is with mainstreaming it because with Hadoop you don't get the ETL capability, data governance, metadata, or data quality and profiling – all the things that we've been building for the last 20 years for the data ecosystem.

That actually leads into my next question. How has the world of big data and Hadoop affected Informatica and its product portfolio?

Alex Gorelik: Ron, it's been wonderful for us. There's a whole class of new big data applications that require data. These applications require data integration, data quality, and all the other capabilities that we bring to the table. We're working very hard to support Hadoop across the portfolio and get all of our products to be useful for those use cases. Big data is usually defined as increased volume, velocity, and variety of data. For the volume, we are obviously already a very large-scale integration platform, and we're pushing our processing into Hadoop to take advantage of the massively parallel scale. For the variety, we open sourced HParser, which is our graphical parsing engine that lets you define parsing without having to write Java code. That enables the use of the different varieties of data – XML, JSON, binary files, sequential files, and so on. For the velocity, we have Ultra-Messaging – a very large-scale messaging platform to feed data to Hadoop and throughout the systems. For some distributions, we are able to stream data onto the cluster. Different distributions have different capabilities.

Alex, how do you see the role of the data integration and the ETL developer changing as a result of big data?

Alex Gorelik: Ronl, just as the need for analytics is increasing, the need for ETL developers is also increasing. Now there's a new role in companies – data scientists. I went to one of the job websites just to see how popular it is, and I found 14,000 postings for data scientists. Basically, a data scientist is a cross between a business analyst, a data analyst, a data mining expert, and a statistician. These people understand the business, and they can take the datasets and run mathematical models, regression analysis, and so on to get business insight. This data scientist role is really popularized by Zynga, LinkedIn and Climate, all the new companies that basically take analytics and monetize it. Zynga monetizes date for gaming, Climate monetizes weather data for insurance, and LinkedIn monetizes data through advertising and recruiting. Effectively, they're data analytics companies.

The data scientists need data, and data engineers are needed to create trusted data from the different sources and make it available for data scientists. This is effectively the role of an ETL developer today.

At Informatica, we are focusing on taking the skills that ETL developers possess and making them relevant for Hadoop with our technology. Basically, people who understand data and what it takes to make data trusted and available can now use their Informatica skills to do this for Hadoop. They become data engineers who can help with all the new use cases.

Alex, one concern I have regarding Hadoop is that it's another data source. Will enterprises just be creating more silos similar to what happened with independent data marts for data warehousing?

Alex Gorelik: I don't think so Ron. When I talk to customers, a lot of the people responsible for Hadoop clusters and a lot of the people writing advanced analytics have data warehousing backgrounds. They fought these wars in the past so they don't want to have a silo and invest in completely different capabilities for Hadoop. There are some vendors who are building Hadoop-specific functionality and management. We are seeing that IT shops don't want to do this. They want to have the same tools and the same levels of data governance across everything.

Recently when I was talking to a CIO, he said, “Alex, government regulations say that your data should be trusted and secure. They don't provide an exception if the data is in Hadoop.” You must have the same compliance and all the same capabilities that provide lineage, data quality, and data assurance in Hadoop that you need outside of it. There is recognition in the industry that you can't afford to have a whole new stack of tools and a whole new island of information that doesn't comply with the rest of ecosystem.

Can you talk about some of the challenges that companies are facing as they try to incorporate Hadoop into their existing infrastructures?

Alex Gorelik: There are two main challenges that I've seen. One is around integration, and we just talked about it. It is part of the ecosystem. You want to be able to apply the same data quality rules, the same data security protection, the same lifecycle management, archival and so on to Hadoop as you do outside of Hadoop.

A lot of the use cases we see involve getting data out of different sources into Hadoop, processing it, doing the analytics, and then taking it out and loading it into other systems, warehouses, in-memory, analytical systems, and so on. For that, you need lineage all the way across. Without that you don’t have compliance. You couldn’t do impact analysis to see what a change in one of the systems does to all of your other systems. You need to be able to orchestrate and see when things fail. You need to be able to tell if a load fails from one of the systems to ensure that the data can be trusted. You need to be able to profile the data and make sure you're getting good data all the way through. Basically, having this holistic approach to data flows is one of the challenges.

The other challenge is skill set. I mentioned some vendors are developing Hadoop-specific tools, but we really don’t believe IT shops are going to adopt completely different tooling and train their people on multiple systems. For that reason, we are developing Hadoop support for all of our solutions so that our users can develop new mappings that can run either inside or outside of Hadoop, orchestrate across all the steps inside and outside of Hadoop, and provide the same level of data quality, lineage, and impact analysis across the whole chain.

That sounds really good. For Hadoop to be enterprise ready, it must have the ability to support all enterprise data requirements.

Alex Gorelik: I absolutely agree, and these requirements are the same inside and outside of Hadoop. I think when you are doing pilots, you can relax it a little bit. You can have physical security, for example, for your Hadoop cluster as some of our customers do. But if you need to integrate back and forth across the whole ecosystem, it must meet enterprise standards. Hadoop is a very rapidly changing system. There are different systems that take priority. It is an open source community-developed system where sometimes projects come into focus and sometimes people move to other projects. A lot of our customers express concern about placing their bets on this open source community-developed system. If they code, for example, HIVE, which is a SQL interface to Hadoop files, and then that project doesn’t take off, and Pig, which is a more of a data flow language, takes off, then all their investments might be lost. With Informatica, our value proposition is if you develop the logic, you can run it outside Hadoop or inside Hadoop, and we're going to support the new projects as well the existing projects in Hadoop so that your investment is not lost.

Are there other ways that Informatica is helping customers deal with the rise of big data and its complexity?

Alex Gorelik: At Informatica, we're very focused on what we call the return on data, which is basically helping people increase the value of data and decrease the associated cost. Hadoop is actually a wonderful opportunity to increase the return on data by enabling a whole new set of use cases. It’s the same data, but now you can use it for a lot more and get a lot more value out of it by doing large-scale analytics. At the same time, it provides extensive compute power so it decreases the cost of data. We have a lot of Hadoop supporting capabilities in production. For example, PowerExchange for Hadoop allows our customers to extract data from all the systems that we support and load it into Hadoop as well as take data out of Hadoop and load it again into all the systems we support. We have the largest portfolio of connectors, and they all become available for the Hadoop use cases.

We have HParser, which is a visual development environment that allows the customers to create parses without having to program. There is a community edition. We open sourced it, and it's going to be part of MapR distributions and, hopefully, Apache at some point. HParser allows customers to process complex files, XML, JSON, and sequential files as well as binary files. A lot of data is compressed or binary so they can visually parse it out and create something that data scientists and analytical packages can use.

We will be announcing Hadoop support for the rest of our products shortly. This includes running any Informatica mapping, profiling, and data quality rules on Hadoop itself. Our customers will be able to use their existing skill sets, assets, mappings and quality rules that they've developed for the data and use the scalability of Hadoop to run those without having to change the logic.

Alex, Release 9.5 is a big release for Informatica from the big data perspective. What do you feel is the biggest part of that announcement?

Alex Gorelik: As I mentioned, we're announcing support for Hadoop across the portfolio for big data – from pushing our mappings to the cluster, to loading data into the cluster and from the cluster, for being able to archive Hadoop data, to being able to profile and do data quality on Hadoop data, to be able to stream data using replication and ultra-messaging. Some of this functionality is already available in production, some will be with 9.5 and some will be in beta or pilot phases. Basically, we're embracing Hadoop in big way because we see a lot of value for our customers from Hadoop, and we want to help to mainstream it. That's really our main focus. We want our customers to be able to get Hadoop out of advance projects and pilots and let it be part of the data ecosystem, enable all the new use cases and increase the return on data.

Thank you, Alex, for providing our readers with an understanding of Informatica’s focus on big data and your commitment to helping them succeed with Hadoop.

Ron PowellRon, an independent analyst and consultant, has an extensive technology background in business intelligence, analytics and data warehousing. In 2005, Ron founded the BeyeNETWORK, which was acquired by Tech Target in 2010. Prior to the founding of the BeyeNETWORK, Ron was cofounder, publisher and editorial director of DM Review (now Information Management). Ron also has a wealth of consulting expertise in business intelligence, business management and marketing. He may be contacted by email at rpowell@powellinteractivemedia.com.