Perspectives The Informatica Blog

John Haddad

This post was written by guest author Dale Kim, Director of Industry Solutions at MapR Technologies, a valued Informatica partner that provides a distribution for Apache Hadoop that ensures production success for its customers.

Apache Hadoop is growing in popularity as the foundation for an enterprise data hub. An Enterprise Data Hub (EDH) extends and optimizes the traditional data warehouse model by adding complementary big data technologies. It focuses your data warehouse on high value data by reallocating less frequently used data to an alternative platform. It also aggregates data from previously untapped sources to give you a more complete picture of data.

So you have your data, your warehouses, your analytical tools, your Informatica products, and you want to deploy an EDH… now what about Hadoop?

Requirements for Hadoop in an Enterprise Data Hub

Let’s look at characteristics required to meet your EDH needs for a production environment:

Enterprise-grade

Interoperability

Multi-tenancy

Security

Operational

You already expect these from your existing enterprise deployments. Shouldn’t you hold Hadoop to the same standards? Let’s discuss each topic:

Consolidated Enterprise Data Hub

Enterprise-Grade

Enterprise-grade is about the features that keep a system running, i.e., high availability (HA), disaster recovery (DR), and data protection. HA helps a system run even when components (e.g., computers, routers, power supplies) fail. In Hadoop, this means no downtime and no data loss, but also no work loss. If a node fails, you still want jobs to run to completion. DR with remote replication or mirroring guards against site-wide disasters. Mirroring needs to be consistent to ensure recovery to a known state. Using file copy tools won’t cut it. And data protection, using snapshots, lets you recover from data corruption, especially from user or application errors. As with DR replicas, snapshots must be consistent, in that they must reflect the state of the data at the time the snapshot was taken. Not all Hadoop distributions can offer this guarantee.

Interoperability

Hadoop interoperability is an obvious necessity. Features like a POSIX-compliant, NFS-accessible file system let you reuse existing, file system-based applications on Hadoop data. Support for existing tools lets your developers get up to speed quickly. And integration with REST APIs enables easy, open connectivity with other systems.

Multi-Tenancy

You should be able to logically divide clusters to support different use cases, job types, user group, and administrators as needed. To avoid a complex, multi-cluster setup, choose a Hadoop distribution with multi-tenancy capabilities to simplify the architecture. This gives you less risk for error and no data/effort duplication.

Security

Security should be a priority to protect against the exposure of confidential data. You should assess how you’ll handle authentication (with or without Kerberos), authorization (access controls), over-the-network encryption, and auditing. Many of these features should be native to your Hadoop distribution, and there are also strong security vendors that provide technologies for securing Hadoop.

Operational

Any large scale deployment needs fast read, write, and update capabilities. Hadoop can support the operational requirements of an EDH with integrated, in-Hadoop databases like Apache HBase™ and Accumulo™, as well as MapR-DB (the MapR NoSQL database). This in-Hadoop model helps to simplify the overall EDH architecture.

Using Hadoop as a foundation for an EDH is a powerful option for businesses. Choosing the correct Hadoop distribution is the key to deploying a successful EDH. Be sure not to take shortcuts – especially in a production environment – as you will want to hold your Hadoop platform to the same high expectations you have of your existing enterprise systems.

Informatica has extended its leadership in data integration and data quality to Hadoop with our Big Data Edition to address all of these Big Data challenges.

The biggest challenge companies’ face is finding and retaining Big Data resource skills to staff their Big Data projects. One large global bank started their first Big Data project with 5 Java developers but as their Big Data initiative gained momentum they needed to hire 25 more Java developers that year. They quickly realized that while they had scaled their infrastructure to store and process massive volumes of data they could not scale the necessary resource skills to implement their Big Data projects. The research mentioned earlier indicates that 80% of the work in a Big Data project relates to data integration and data quality. With Informatica you can staff Big Data projects with readily available Informatica developers instead of an army of developers hand-coding in Java and other Hadoop programming languages. In addition, we’ve proven to our customers that Informatica developers are up to 5 times more productive on Hadoop than hand-coding and they don’t need to know how to program on Hadoop. A large Fortune 100 global manufacturer needed to hire 40 data scientists for their Big Data initiative. Do you really want these hard-to-find and expensive resources spending 80% of their time integrating and preparing data?

Another key challenge is that it takes too long to deploy Big Data projects to production. One of our Big Data Media and Entertainment customers told me prior to purchasing the Informatica Big Data Edition that most of his Big Data projects had failed. Naturally, I asked him why they had failed. His response was, “We have these hot-shot Java developers with a good idea which they prove out in our sandbox environment. But then when it comes time to deploy it to production they have to re-work a lot of code to make it perform and scale, make it highly available 24×7, have robust error-handling, and integrate with the rest of our production infrastructure. In addition, it is very difficult to maintain as things change. This results in project delays and cost overruns.” With Informatica, you can automate the entire data integration and data quality pipeline; everything you build in the development sandbox environment can be immediately and automatically deployed and scheduled for production as enterprise ready. Performance, scalability, and reliability are simply handled through configuration parameters without having to re-build or re-work any development which is typical with hand-coding. And Informatica makes it easier to reuse existing work and maintain Big Data projects as things change. The Big Data Editions is built on Vibe our virtual data machine and provides near universal connectivity so that you can quickly onboard new types of data of any volume and at any speed.

Big Data technologies are emerging and evolving extremely fast. This in turn becomes a barrier to innovation since these technologies evolve much too quickly for most organizations to adopt before the next big thing comes along. What if you place the wrong technology bet and find that it is obsolete before you barely get started? Hadoop is gaining tremendous adoption but it has evolved along with other big data technologies where there are literally hundreds of open source projects and commercial vendors in the Big Data landscape. Informatica is built on the Vibe virtual data machine which means that everything you built yesterday and build today can be deployed on the major big data technologies of tomorrow. Today it is five flavors of Hadoop but tomorrow it could be Hadoop and other technology platforms. One of our Big Data Edition customers, stated after purchasing the product that Informatica Big Data Edition with Vibe is our insurance policy to insulate our Big Data projects from changing technologies. In fact, existing Informatica customers can take PowerCenter mappings they built years ago, import them into the Big Data Edition and can run on Hadoop in many cases with minimal changes and effort.

Another complaint of business is that Big Data projects fail to deliver the expected value. In a recent survey (1), 86% Marketers say they could generate more revenue if they had a more complete picture of customers. We all know that the cost of us selling a product to an existing customer is only about 10 percent of selling the same product to a new customer. But, it’s not easy to cross-sell and up-sell to existing customers. Customer Relationship Management (CRM) initiatives help to address these challenges but they too often fail to deliver the expected business value. The impact is low marketing ROI, poor customer experience, customer churn, and missed sales opportunities. By using Informatica’s Big Data Edition with Master Data Management (MDM) to enrich customer master data with Big Data insights you can create a single, complete, view of customers that yields tremendous results. We call this real-time customer analytics and Informatica’s solution improves total customer experience by turning Big Data into actionable information so you can proactively engage with customers in real-time. For example, this solution enables customer service to know which customers are likely to churn in the next two weeks so they can take the next best action or in the case of sales and marketing determine next best offers based on customer online behavior to increase cross-sell and up-sell conversions.

Chief Data Officers and their analytics team find it difficult to make Big Data fit-for-purpose, assess trust, and ensure security. According to the business consulting firm Booz Allen Hamilton, “At some organizations, analysts may spend as much as 80 percent of their time preparing the data, leaving just 20 percent for conducting actual analysis” (2). This is not an efficient or effective way to use highly skilled and expensive data science and data management resource skills. They should be spending most of their time analyzing data and discovering valuable insights. The result of all this is project delays, cost overruns, and missed opportunities. The Informatica Intelligent Data platform supports a managed data lake as a single place to manage the supply and demand of data and converts raw big data into fit-for-purpose, trusted, and secure information. Think of this as a Big Data supply chain to collect, refine, govern, deliver, and manage your data assets so your analytics team can easily find, access, integrate and trust your data in a secure and automated fashion.

If you are embarking on a Big Data journey I encourage you to contact Informatica for a Big Data readiness assessment to ensure your success and avoid the pitfalls of the top 5 Big Data challenges.

Gleanster Survey of 100 senior level marketers. The title of this survey is, Lifecycle Engagement: Imperatives for Midsize and Large Companies. Sponsored by YesMail.

In the recent past, we were constrained by many limitations around data. Now, we are only limited by our imagination. By using more data and more types of data, we can fundamentally transform our organizations, and our world. These transformations bring great opportunity, but also come with challenges. To that end, here is my take on the top 5 big data challenges organizations face today:

The biggest challenge by far that we see with Big Data is that it is difficult to find and retain the resources skills to staff Big Data projects. The fact that “Its expertise is scarce and expensive” is the #1 concern about using Big Data according to an Information Week survey of 541 business technology professionals (1). And according to Gartner by 2015, only a third of the 4.4 million big data related jobs will be filled (5)

2) It takes too long to deploy Big Data projects from ‘proof-of-concept’ to production

At Hadoop Summit in June 2014, one of the largest Big Data conferences in the world, Gartner stated in their keynote that only about 30% of Hadoop implementations are in production (4). This observation highlights the second challenge which is that it takes too long to deploy Big Data projects from the ‘proof-of-concept’ phase into production.

3) Big data technologies are evolving too quickly to adapt

With the related market projected to grow from $28.5 billion in 2014 to $50.1 billion in 2015 according to Wikibon (6), Big Data technologies are emerging and evolving extremely fast. This in turn becomes a barrier to innovation since these technologies evolve much too quickly for most organizations to adopt before the next big thing comes along.

4) Big Data projects fail to deliver the expected value

Too many Big Data projects start off as science experiments and fail to deliver the expected value primarily because of inaccurate scope. They underestimate what it takes to integrate, operationalize, and deliver actionable information at production scale. According to an InfoChimp survey of 300 IT professionals “55% of big data projects don’t get completed and many others fall short of their objectives” (3)

Uncertainty is inherent to Big Data when dealing with a wide variety of large data sets coming from external data sources such as social, mobile, and sensor devices. Therefore, organizations often struggle to make their data fit-for-purpose, assessing the level of trust, and ensuring data level security. According to Gartner, “Business leaders recognize that big data can help deliver better business results through valuable insights. Without an understanding of the trust implicit in the big data (and applying information trust models), organizations maybe be taking risks that undermine the value they seek.” (2)

Big Data Governance From Truth to Trust, Gartner Research Note, July 2013

“CIOs & Big Data: What Your IT Team Wants You to Know,“ – Infochimps conducted its survey of 300 IT staffers with assistance from enterprise software community site SSWUG.ORG. http://visual.ly/cios-big-data

A hundred years from now people will look back at this period of time and refer to it as the Data Dark Ages. A time when the possibilities were endless but due to siloed data fiefdoms and polluted data sets the data science warlords and their minions experienced an insatiable hunger for data and rampant misinformation driving them to the brink of madness. The minions spent endless hours in the dark dungeons preparing data from raw and untreated data sources for their data science overseers. Solutions to the worlds’ most vexing problems were solvable if only the people had abundant access to clean and safe data to drive their analytic engines.

Legend held that a wizard in the land of Informatica possessed the magic of a virtual data machine called Vibe where a legion of data engineers built an intelligent data platform to provide a limitless supply of clean, safe, secure, and reliable data. While many had tried to build their own data platforms only those who acquired the Informatica Intelligent Data Platform powered by Vibe were able to create true value and meaning from all types of data.

As word spread about Informatica Vibe and the Intelligent Data Platform data scientists and analysts sought its magic so they could have greater predictive power over the future. The platform could feed any type of data of any volume into a data lake where Vibe, no matter the underlying technology, prepared and managed the data, and provisioned data to the masses hungry for actionable and reliable information.

An analytics renaissance soon emerged as more organizations adopted the Informatica Intelligent Data Platform where data was freely yet securely shared, integrated and cleansed at will, matched and correlated in real-time. The data prep minions were set free and data scientists were able to spend the majority of their time discovering true value and meaning through big data analytics. The pace of innovation accelerated and humanity enjoyed a new era of peace and prosperity.

Data Warehouse Optimization (DWO) is becoming a popular term that describes how an organization optimizes their data storage and processing for cost and performance while data volumes continue to grow from an ever increasing variety of data sources.

Data warehouses are reaching their capacity much too quickly as the demand for more data and more types of data are forcing IT organizations into very costly upgrades. Further compounding the problem is that many organizations don’t have a strategy for managing the lifecycle of their data. It is not uncommon for much of the data in a data warehouse to be unused or infrequently used or that too much compute capacity is consumed by extract-load-transform (ELT) processing. This is sometimes the result of business requests for one off business reports that are no longer used or staging raw data in the data warehouse. A large global bank’s data warehouse was exploding with 200TB of data forcing them to consider an upgrade that would cost $20 million. They discovered that much of the data was no longer being used and could be archived to lower cost storage thereby avoiding the upgrade and saving millions. This same bank continues to retire data monthly resulting in on-going savings of $2-3 million annually. A large healthcare insurance company discovered that fewer than 2% of their ELT scripts were consuming 65% of their data warehouse CPU capacity. This company is now looking at Hadoop as a staging platform to offload the storage of raw data and ELT processing freeing up their data warehouse to support the hundreds of concurrent business users. A global media & entertainment company saw their data increase by 20x per year and the associated costs increase 3x within 6 months as they on-boarded more data such as web clickstream data from thousands of web sites and in-game telemetry data.

In this era of big data, not all data is created equal with most raw data originating from machine log files, social media, or years of original transaction data considered to be of lower value – at least until it has been prepared and refined for analysis. This raw data should be staged in Hadoop to reduce storage and data preparation costs while the data warehouse capacity should be reserved for refined, curated and frequently used datasets. Therefore, it’s time to consider optimizing your data warehouse environment to lower costs, increase capacity, optimize performance, and establish an infrastructure that can support growing data volumes from a variety of data sources. Informatica has a complete solution available for data warehouse optimization.

The first step in the optimization process as illustrated in Figure 1 below is to identify inactive and infrequently used data and ELT performance bottlenecks in the data warehouse. Step 2 is to offload the data and ELT processing identified in step 1 to Hadoop. PowerCenter customers have the advantage of Vibe which allows them to map once and deploy anywhere so that ELT processing executed through PowerCenter pushdown capabilities can be converted to ETL processing on Hadoop as part of a simple configuration step during deployment. Most raw data, such as original transaction data, log files (e.g. Internet clickstream), social media, sensor device, and machine data should be staged in Hadoop as noted in step 3. Informatica provides near-universal connectivity to all types of data so that you can load data directly into Hadoop. You can even replicate entire schemas and files into Hadoop, capture just the changes, and stream millions of transactions per second into Hadoop such as machine data. The Informatica PowerCenter Big Data Edition makes every PowerCenter developer a Hadoop developer without having to learn Hadoop so that all ETL, data integration and data quality can be executed natively on Hadoop using readily available resource skills while increasing productivity up to 5x over hand-coding. Informatica also provides data discovery and profiling tools on Hadoop to help data science teams collaborate and understand their data. The final step is to move the resulting high value and frequently used data sets prepared and refined on Hadoop into the data warehouse that supports your enterprise BI and analytics applications.

To get started, Informatica has teamed up with Cloudera to deliver a reference architecture for data warehouse optimization so organizations can lower infrastructure and operational costs, optimize performance and scalability, and ensure enterprise-ready deployments that meet business SLA’s. To learn more please join the webinar A Big Data Reference Architecture for Data Warehouse Optimization on Tuesday November 19 at 8:00am PST.

The power of big data means you can access and analyze all of your data. Several of the worlds’ top companies and government agencies use MongoDB for applications today which means more and more data is being stored in MongoDB. Informatica ensures that you can access and integrate all of this vital data with other enterprise data.

To make it easy for developers to get data into MongoDB, Informatica provides a visual development environment with near universal connectivity including MongoDB, pre-built parsers and transforms to ingest data into MongoDB. With Informatica and MongoDB, developers waste no time accessing and preparing the data necessary to build their big data applications at scale. Informatica can access all types of data from traditional relational databases, legacy mainframes, enterprise applications such as ERP and CRM, cloud applications, social data, machine data, and industry standards data. Once the data is accessed you can integrate and transform it into the native MongoDB JSON document format. For example, a large insurance company is moving massive amounts of policy information from a dozen relational data sources, transforming from relational to hierarchical JSON documents and populating MongoDB.

Informatica ensures that MongoDB does not become another data silo in your enterprise information management infrastructure. Now with Informatica, companies can unlock the data in MongoDB for downstream analytics to improve decision making and business operations. Using the Informatica PowerCenter Big Data Edition with the PowerExchange for MongoDB adapter you can access data in MongoDB, parse the JSON-based documents and then transform the data and combine it with other information for big data analytics all without having to write a single-line of code. Informatica + MongoDB is a powerful combination that increases developer productivity up to 5x so you can build and deploy big data applications much faster.

The hype around big data is certainly top of mind with executives at most companies today but what I am really seeing are companies finally making the connection between innovation and data. Data as a corporate asset is now getting the respect it deserves in terms of a business strategy to introduce new innovative products and services and improve business operations. The most advanced companies have C-level executives responsible for delivering top and bottom line results by managing their data assets to their maximum potential. The Chief Data Officer and Chief Analytics Officer own this responsibility and report directly to the CEO. (more…)

Has big data entered the “trough of disillusionment?” That’s what I’ve heard recently. Like many hyped up technology trends the trough can be deep and long as project failures accumulate, or for ‘hot’ trends that evolve and mature quickly the trough can be shallow and short, leading to broader and rapid adoption. Is the big data hype failing to deliver on its promise of increased revenue and competitive advantage for companies that leverage big data to introduce new products and services and improve business operations? Why is it that some big data projects fail to deliver on their promise? Svetlana Sicular, Research Director, Gartner points out in her blog Big Data is Falling into the Trough of Disillusionment that, “These [advanced client] organizations have fascinating ideas, but they are disappointed with a difficulty of figuring out reliable solutions.” There are several reasons why big data projects may fail to deliver on their promise: (more…)

In a recent webinar, Mark Smith, CEO at Ventana Research and David Lyle, vice president, Product Strategy at Informatica discussed: “Building the Business Case and Establishing the Fundamentals for Big Data Projects.” Mark pointed out that the second biggest barrier that impedes improving big data initiatives is that the “business case is not strong enough.” The first and third barriers respectively, were “lack of resources” and “no budget” which are also related to having a strong business case. In this context, Dave provided a simple formula from which to build the business case:

With the rise in popularity of the elusive and expensive data scientist it’s very sad that once a data science team is assembled (at a very high recurring cost to the company I may add) that they spend most of their time doing work they weren’t really hired to do in the first place. That’s right! It turns out that data scientists spend only about 20% of their time doing real analysis – that is the work they were trained to do. How is the other 80% of their time spent? (more…)