Posts Tagged 'Hadoop'

Companies are producing massive amounts of data—otherwise known as big data. There are many options available to manage big data and the analytics associated with it. One of the more popular options is Apache Hadoop, an open source software designed to scale up and down quickly with a high degree of fault tolerance. Hadoop lets organizations gather and examine large amounts of structured and unstructured data.

In the past, large CAPEX and deployment costs made large big data or Hadoop clusters cost prohibitive. Cloud providers, like IBM Cloud, have made it possible to break through the cost barriers. The cloud model, with its utility-type billing and usage charges, makes it possible to build big data clusters, use them for a specific project, then tear them down. IBM Cloud is a great solution for this type of scenario and makes sense for those that require short term or project-based Hadoop clusters. Hadoop on IBM Cloud allows organizations to respond faster to changing business needs and requirements without the upfront CAPEX.

What makes Hadoop on IBM Cloud so compelling are the components that are available in the IBM Cloud offering. Customers have the ability to choose and use the same type of components and standards that they would use in their own data centers. These components include bare metal servers, private and unmetered private networks, and enterprise-grade block and object storage. IBM Cloud also offers GPUs for the most processor-intense big data workloads. Customers don’t have to settle for less when deploying their Hadoop clusters in IBM Cloud.

Hadoop on IBM Cloud supports multiple data centers in different regions across the globe. The diagram below provides the graphical layout of Hadoop clusters across multiple IBM Cloud data centers.

Editor’s Note: Does your brain switch off when you hear industryspeak words like “innovation,” “transformation,” “leading edge,” “disruptive,” and “paradigm shift”? Go on, go ahead and admit it. Ours do, too. That’s why we’re launching the future.ready() series—consisting of blogs, podcasts, webinars, and Twitter chats— with content created by developers, for developers. Nothing fluffy, nothing buzzy. With the future.ready() series, we aim to equip you with tools and knowledge that you can use—not just talk and tweet about.

For the first edition, I’ve invited Frank Ketelaars, an expert in high volume data space, to walk us through seven things to check off when starting a big data development project.

I have worked on multiple high volume projects in industries that include banking, telecommunications, manufacturing, life sciences, and government, and in roles including architect, big data developer, and streaming analytics specialist. Based on my experience, here’s a checklist I put together that should give developers a good start. Did I miss anything? Join me on the Twitter chat or webinar to share your experience, ask questions, and discuss further. (See details below.)

1. Team up with a person who has a budget and a problem you can solve.

For a successful big data project, you need to solve a business problem that’s keeping somebody awake at night. If there isn’t a business problem and a business owner—ideally one with a budget— your project won’t get implemented. Experimentation is important when learning any new technology. But before you invest a lot of time in your big data platform, find your sponsor. To do so, you’ll need to talk to everyone, including IT, business users, and management. Remember that the technical advantages of analytics at scale might not immediately translate into business value.

2. Get your systems ready to collect the data.

With additional data sources, such as devices, vehicles, and sensors connected to networks and generating data, the variety of information and transportation mechanisms has grown dramatically, posing new challenges for the collection and interpretation of data.

Big data often comes from sources outside the business. External data comes at you in a variety of formats (including XML, JSON, and binary), and using a variety of different APIs. In 2016, you might think that everyone is on REST and JSON, but think again: SOAP still exists! The variety of the data is the primary technical driver behind big data investments, according to a survey of 402 business and IT professionals by management consultancy NewVantage Partners[SM1] . From one day to the next, the API might change or a source might become unavailable.

Maybe one day we’ll see more standardization, but it won’t happen any time soon. For now, developers must plan to spend time checking for changes in APIs and data formats, and be ready to respond quickly to avoid service interruptions. And to expect the unexpected.

3. Make sure you have the right to use that data.

Governance is a business challenge, but it’s going to touch developers more than ever before—from the very start of the project. Much of the data they will be handling is unstructured, such as text records from a call center. That makes it hard to work out what’s confidential, what needs to be masked, and what can be shared freely with external developers. Data will need to be structured before it can be analyzed, but part of that process includes working out where the sensitive data is, and putting measures in place to ensure it is adequately protected throughout its lifecycle.

Developers need to work closely with the business to ensure that they can keep data safe, and provide end users with a guarantee that the right data is being analyzed and that its provenance can be trusted. Part of that process will be about finding somebody who will take ownership of the data and attest to its quality.

Not all of these tools and technologies will be right for you, but they hint at one way the developer’s core competency must change. Big data will require developers to be polyglots, conversant in perhaps five languages, who specialize in learning new tools and languages fast—not deep experts in one or two languages.

Nota bene:MapReduce and Pig are among the top highest paid technology skills in the US, and other big data skills are likely to be highly sought-after as the demand for them also grows. Scala is a relatively new functional programming language for data preparation and analysis, and I predict it will be in high demand in the near future.

5. Forget “off-the-shelf.” Experiment and set up a big data solution that fits your needs.

You can think of big data analytics tools like Hadoop as a car. You want to go to the showroom, pay, get in, and drive away. Instead, you’re given the wheels, doors, windows, chassis, engine, steering wheel, and a big bag of nuts and bolts. It’s your job to assemble it.

When experimenting with concepts and technologies to solve a certain business problem, also think about successful deployment in the organization. The project does not stop after the proof.

6. Secure resources for changes and updates.

Apache Hadoop and Apache Spark are still evolving rapidly and it is inevitable that the behavior of components will change over time and some may get deprecated shortly after initial release. Implementing new releases will be painful, and developers will need to have an overview of the big data infrastructure to ensure that as components change, their big data projects continue to perform as expected.

The developer team must plan time for updates and deprecated features, and a coordinated approach will be essential for keeping on top of the change.

My preferred definition of big data (and there are many – Forbes found 12) is this: "Big data is when you can no longer afford to bring the data to the processing, and you have to do the processing where the data is."

In traditional database and analytics applications, you get the data, load it onto your reporting server, process it, and post the results to the database.

With big data, you have terabytes of data, which might reside in different places—and which might not even be yours to move. Getting it to the processor is impractical. Big data technologies like Hadoop are based on the concept of data locality—doing the processing where the data resides.

You can run Hadoop in a virtualized environment. Virtual servers don’t have local data, though, so the time taken to transport data between the SAN or other storage device and the server hurts the application’s performance. Noisy neighbors, unpredictable server speeds and contested network connections can have a significant impact on performance in a virtualized environment. As a result, it’s difficult to offer service level agreements (SLAs) to end users, which makes it hard for them to depend on your big data implementations.

The answer is to use bare metal servers on demand, which enable you to predict and guarantee the level of performance your application can achieve, so you can offer an SLA with confidence. Clusters can be set up quickly, so you can accelerate your project really fast. Because performance is predictable and consistent, it’s possible to offer SLAs to business owners that will encourage them to invest in the big data project and rely on it for making business decisions.

How can I learn more?

Join me in the Twitter chat and webinar (details below) to discuss how you’re addressing big data or have your questions answered by me and my guests.

Frank Ketelaars has been Big Data Technical Leader in Europe for IBM since August 2013. As an architect, big data developer, and streaming analytics specialist, he has worked on multiple high volume projects in banking, telecommunications, manufacturing, life sciences and government. He is a specialist in Hadoop and real-time analytical processing.

We invite each of our featured SoftLayer Tech Marketplace Partners to contribute a guest post to the SoftLayer Blog, and this week, we're happy to welcome Yaniv Mor from Xplenty. Xplenty is a cloud-based code-free Hadoop as a Service platform that allows you to easily create data workflows, provision, monitor and scale clusters. Their goal is to eliminate the complexity of Hadoop to make it accessible and cost-effective for everyone.

Simplifying Hadoop

Apache Hadoop, open source software developed by Doug Cutting, is the most popular storage and processing platform for big data. Because Hadoop can accommodate structured data, semi-structured data, and unstructured data, it is the storage architecture of choice for some of the Internet's largest and most data-rich sites. Industry giants such as Google and Facebook have been using Hadoop for years to store and deliver information while gathering insights from customer behavior and internal business processes, and their obvious success with the platform has helped drive broad adoption and popularity all the way down to small-businesses and startups.

Specific use cases vary among industries, but similarities exist. Many companies leverage Hadoop to gather information about their clientele. With Hadoop, a company can process huge amounts of data to examine past and present behaviors, and with that information, customers can be presented personally-tailored recommendations, and the business can glean deep insights from the trends and outliers in its customer base. As a result, customers are more likely to make repeat purchases, and companies are able to predict trends and possible risks, allowing them to visualize and prepare for a number of business scenarios.

Another compelling use case for Hadoop is its ability to analyze and report on multi-faceted marketing and advertising campaigns. By drilling down into the guts of a campaign, users can see exactly what worked and what didn't. Marketers and advertisers can direct their resources to the campaigns that worked and let the ineffective ones fall by the wayside.

On the internal side, businesses are using Hadoop to better understand their own information. Data systems at financial companies use it to detect fraud anomalies by comparing transaction details. If you've ever made a credit card purchase in another state or country but the purchase didn't go through, your bank's system probably flagged the transaction for a representative to investigate. Other companies analyze data collected from their networks to monitor activity and diagnose bottlenecks and other issues with a negative impact.

The challenge with leveraging Hadoop's broad potential is that a company generally needs dedicated technical resources to allocate toward building and maintaining the solution — from manpower to financial to infrastructure. Hadoop is difficult to program and requires a very specific skill set that few possess. If a company doesn't have the personnel for the job, it will need to fork over some serious cash to get a system built and maintained. This can significantly hinder the progress of the data and business intelligence teams, and by default, the progress of the company. That's why we decided to create Xplenty.

Xplenty is a coding-free Hadoop-as-a-Service platform that allows data and BI users to process their big data stored on the SoftLayer cloud without having to acquire any special skills. What Xplenty does is remove the need to divert those precious resources from anything other than the business at hand. Xplenty's Hadoop-as-a-Service platform has a graphical user interface that enables the data and BI teams to build data flows without ever having to write a line of code. The benefit of this is twofold. First, the business intelligence analysts can quickly build data flows that would typically take weeks or more to program and debug, and data users can easily insert Xplenty into their data stack to handle processing needs. The second benefit is that since the IT department doesn't have to worry about doing any programming, they are able to tackle more pressing issues, bottlenecks are avoided, and life goes on without a hitch.

Xplenty was created specifically for the cloud, and SoftLayer is a major player in this space, so it was a natural fit for us to partner up to provide a SoftLayer-specific offering that will perform even better for customers already using SoftLayer infrastructure. We only work with providers with the best and most stable infrastructure, and SoftLayer is definitely at the top of the list.

If you want to try Hadoop on Xplenty, jump over to our SoftLayer sign up page, enter your details, and test drive the platform with a free 30-day trial!

This guest blog series highlights companies in SoftLayer's Technology Partners Marketplace. These Partners have built their businesses on the SoftLayer Platform, and we're excited for them to tell their stories. New Partners will be added to the Marketplace each month, so stay tuned for many more come.

So there I was after work today, sitting in my favorite watering hole drinking my Jagerbomb, when Caira, my bartender asked what was on my mind. I told her that I had been working with clouds and elephants all day at work and neither of those things are little. She laughed and asked if I had stopped anywhere to get a drink prior to her bar. I replied no, I'm serious I had to make some large clouds and a stampede of elephants work together. I then explained to her what Hadoop was. Hadoop is a popular open source implementation of Google's MapReduce. It allows transformation and extensive analysis of large data sets using thousands of nodes while processing peta-bytes of data. It is used by websites such as Yahoo!, Facebook, Google, and China's best search engine Baidu. I explained to her what cloud computing was (multiple computing nodes working together) hence my reference to the clouds, and how Hadoop was named after the stuffed elephant that belonged to one of the founders - Doug Cutting - child. Now she doesn't think I am as crazy.