An Engineer's Story from the Trenches

The explosion of big data in recent years has led cloud computing giants into constant races to deliver the best-of-breed cloud big data solutions for business needs. Amazon Web Services (AWS) has long been known for its scalability, agility, and affordability, while Microsoft Azure has risen to compete as a powerful, versatile enterprise-class cloud platform.

I’ve recently learned that Bob Hummel, our expert in cloud deployments for big data, has just migrated a customer from AWS to Microsoft Azure. “Great, here’s my chance to get a deep dive into this topic,” I thought to myself. At last, I’m here today in Marley’s Brewery and Grill with Bob, after a long day of lumberjacking at his cabin, eager to hear his story.

Paul Nelson: Let’s start with AWS. How long would you say it took to create a production Cloudera big data system on AWS? Say, from scratch to production, not including development time.

Bob Hummel: About three weeks. I had done a number of Hadoop open source deployments prior to deploying Cloudera on AWS. But with AWS, I needed to address some issues like IP address resolution, clock synchronization, opening port numbers, and so on, to ensure the system runs smoothly.

Paul: So you saw challenges with IP address resolution, clock synchronization, and port numbers. Can you describe these in just a little more detail?

Bob: OK. With IP address, the /etc/sysconf/ network file was set up automatically by Amazon, with the internal domain names labeled as “localhost” rather than the machine domain name. This meant that processes in Hadoop couldn’t find each other. The domain names eventually needed to be fully qualified everywhere, including /etc/hosts.

The issue with the ports was that it took a while to identify the full list of port numbers to open so that Hadoop processes could talk to each other.

Clock synchronization was a minor concern, together with some other minor things here and there.

To summarize, you need to go through the Hadoop installation instructions carefully to make sure everything, including SELinux security settings, was configured properly in advance.

Paul: What types of machines did you end up with in AWS?

Bob: I believe they were M3 2XLarge, 32GB of RAM, 8 cores, with Amazon Elastic Block Storage (EBS) for the Hadoop Distributed File System (HDFS), [Ed. Note: HDFS is a highly fault-tolerant distributed file system designed to run on commodity hardware].

Paul: Now, how long did it take to migrate the big data application from AWS to Microsoft Azure?

Bob: Oh gosh, it’s been so long I can barely remember. About nine months, I think.

Paul: Nine months? Are you serious? Why did it take so long?

Bob: One of the bottlenecks was that the hard disks on Azure are quite a bit slower than AWS. Additionally, there would be frequent high spikes in disk latency, causing the Cloudera nodes to lock up, which then would require a reboot. For example, if a Cloudera data node tried to access the disk and Azure was taking as long as minutes to fulfill that request, the machine would be ‘locked.’ When a machine was ‘locked’ like that, the Cloudera Manager would assume it was crashed and then start a new instance, leaving the old process behind as an orphaned process.

Paul: How did you finally resolve the problem?

Bob: Unexpectedly! We were required to upgrade to the latest version of Linux Integration Services so that iostat (input/output statistics) would give us the logs that Microsoft Support wanted to see. After the upgrade, the disk latency problem was gone. The upgrade apparently included a number of performance improvements to how Linux communicates with Microsoft’s Hyper-V virtualization environment. Those changes finally solved the problem.

On the other hand, switching to Azure Premium Storage would have resolved that as well, but the service is not available in our desired region (currently available in West US, East US 2, West Europe, East China, Southeast Asia, and West Japan). In particular, our servers had to run in the North Europe region due to legal requirements on what countries can host our customer’s data. You can get more details from Cloudera’s reference architecture for Azure.

Paul: Were there other challenges that caused the long delay in deploying Cloudera on Azure?

Bob: There were network implications when we originally created a single storage account for the whole cluster. So we reorganized the storage (one storage account per host) and increased the number of disks per host from four to ten. We also upped the size of the disks to 1TB each and switched from geo-redundant to locally-redundant storage.

Another challenge was the lack of automation and a user-friendly interface. The Azure Web Portal can be slow and tedious. You have to dig into automating everything using PowerShell to ease the process of setting up an entire cluster.

Paul: How about handling the networking, IP addresses, and domain names with groups of machines?

Bob: Oh yeah, I remember. In AWS, all machines are individually accessible. In Azure, machines are grouped into a “cloud service” and they all respond to the same domain name but different ports.

Paul: How do you access different ports on the same machine?

Bob: The different ports are only for external access to individual machines. Internally, the machines talk to each other like they normally do. Anyway, this was quite different than AWS and took a little getting used to. Since we have different clusters of machines, such as Hadoop, Solr, Aspire, etc., Microsoft encouraged us to group each cluster of machines into a separate Cloud Server, by machine type, which controls the communications between these clusters.

Paul: What are your thoughts on deploying Cloudera to AWS vs. Azure?

Bob: What I really like about Azure is that it’s easy to set up our own internal network with subnets and other configurations for each different cluster; then, actually map the IPs of our choice to each machine as well as our own domain names using our own DNS server. We set up our internal domain names to look exactly the same as the external domain names, which enabled the Cloudera Web Interfaces (Cloudera Manager, Job Tracker Interface, Resource Manager, etc.) to work properly.

On AWS, all of the web links inside the Cloudera UIs use the internal machine names, and you have to map them on your client /etc/hosts file to get them to work. Although, now that I think of it, there may have been a way to set up a domain name server on AWS too (like we did on Azure), so perhaps it was just because we didn't think of it.

Paul: Any other comments on this cloud migration?

Bob:[Thinks] We’ve hit all of the high points. Oh, we also talked to a Microsoft engineer who had been working with Cloudera, and she gave us a list of memory settings which tuned Hadoop for the particular machines we were using. That improved performance of our MapReduce jobs by about 30%.

Paul: What types of machines did you end up with on Azure?

Bob: D14, 128GB of RAM, 32 cores.

Paul: Were there fewer machines needed in Azure, since they’re so much bigger than the 32GB RAM, 8 core machines built on AWS?

Bob: No. Our cluster has four nodes. You need at least four data nodes so you can still have 3x replication if a node goes down. It’s nice that Azure is more generous with memory than AWS.

Paul: Great! Sounds like this AWS-to-Azure thing was quite a learning experience.

Bob: Cloudera was going through the learning experience at the same time that I was. It was both challenging and rewarding being an early adopter of the Azure platform for big data.

Paul: Any last thoughts on Azure as a big data platform?

Bob: I would definitely recommend following the Cloudera’s reference architecture for Microsoft Azure deployments to the letter, as there is less margin for error in Azure than AWS. In general, Azure works fine if you have a thorough preparation and migration plan. Azure is in its early days, and I’m sure it will emerge as a key player of cloud infrastructure for big data over time.

Paul: Thank you so much for your time! And for letting me eat the rest of your chicken. Next time, let’s do this interview between two ferns.