Hadoop Installation: Bare metal vs Cloud

Choosing the right analytics platform and provider comes down to how to store, manage and analyze massive amounts of data safely, effectively and affordable. A traditional on-premise Hadoop platform is quite expensive. The reason being a physical platform requires large numbers of servers, equally big facility to house them, and huge power to keep them running. In addition to this, on-premise Hadoop platforms need on-site IT teams to ensure smooth and glitch-free operations. In contrast, cloud storage requires no on-site hardware or support. Also, companies that implement with Hadoop in the cloud have the benefit of purchasing access to fully scalable storage and analytics platform while only paying for what they use.

Traditionally, on-premise clusters have been a popular deployment option for Hadoop, with a thought towards avoiding I/O overhead in virtualized environments. Hadoop in an on-premise configuration gives businesses complete control over their Hadoop cluster and their data. With the proliferation of the cloud organizations facing heavy security and compliance regulations, some companies may prefer to keep everything in-house. On-premise Hadoop also avoids the complexity or potential log-in of vendor SLA agreements.

On-Premise vs. Cloud:

On-premise platforms come with hard limits on storage capacity and performance, all due to their physical nature. With the increasing data demands, more physical servers must be added to the cluster, making the whole process time-consuming and costly. A cloud platform offers total scalability, meaning that organizations can access on-demand unlimited storage space. Thousands of virtual servers can be spun up in the cloud in a matter of minutes as and when needed. Here again, businesses only pay for the actual space that they need and use to meet increased data demands.

With analytics platforms, productivity has become a function of data accessibility. The drawback of on-premise platforms is their inability to quickly and easily access data. However, with a cloud-based Hadoop platform data can be accessed anytime from anywhere on smartphones and tablets using an Internet connection. Greater and faster access to data results in increased productivity.

Organizations take calculated risks with cloud providers based on the provider’s level of security and responsiveness to technical issues. Though rare, the cloud provider can have downtime that impacts a businesses’ ability to run operations or meet the customer’s demand for queries.

Elasticity:

Elasticity means adapting to the workload in real-time by adding or removing resources as needed. Hadoop on the cloud is exceptionally elastic due to its ability to initialize or stop cloud instances on demand or as per changing demand. In fact, the entire cluster can be used and wholly discarded once a job is finished. Comparing it with on-premise Hadoop which is not elastic at all – clusters can’t grow or contract quickly with changing demands. Even if extra machines are kept on the side for peak hours, they are not utilized rest of the time cost a lot of money.

On the other hand, Hadoop on the cloud costs (TCO) much lower to operate compared to on-premise.

Cost:

Substantial money can be saved on staff, technical support, and by not having to keep a data center. This leaves more cash to buy resources. The elasticity of Hadoop on the cloud also helps saving money since the exact amount of resources are used and paid for.

Performance:

Apache Hadoop runs slowly on virtual environments due to intensive I/O operations is a common assumption. However, cloud-based setups with the same TCO as a bare-metal setup have better performance running Hadoop with real-world applications. But a one-on-one review with the same hardware, though more costly, would probably make on-premise the winner considering faster network access and data locality. Since entire infrastructure gets virtualized, incompatibility issues may arise at times that may pose serious challenges to the smooth running of services.

Availability:

Availability for on-premise Hadoop depends on the quality of hardware and workforce available. There are no guarantees. High Availability for the NameNode server depends on whether Hadoop YARN is used where such features are available, or workarounds have been prepared for older versions.

Availability of Hadoop on cloud depends on which provider is used and their service level agreement. AWS promises at least 99.95% availability for EC2 while GCP promises 99.9% availability for Cloud Storage & BigQuery. High availability depends on the service provider’s implementation, so it may be available or not. But even if the cluster isn’t working, it’s always possible to add more machines or load a new cluster on the cloud.

Durability:

On-premise durability depends on HDFS. As per a statistical model for HDFS data durability (https://issues.apache.org/jira/browse/HDFS-2535), the probability of losing a block of data (64MB by default) on a large 4,000 node cluster (16 PB total storage, 250,736,598 block replicas) is 5.7×10-7 in the next 24 hours and 2.1×10-4 in the next one year. However, for most clusters, containing only a few numbers of instances, the probability of losing data can be much higher.

Hadoop on cloud durability depends on the way the data is stored. If using HDFS, then it is not much different than on-premise. However, using other ways such as S3 may tip the scale – S3 provides durability of 99.999999999% of objects per year, meaning that a single object could be lost per 10,000 objects once every 10,000,000 years. That’s pretty durable.

Security:

When it comes to on-premise security, Hadoop platforms fare well. After all, sensitive data can safely be kept behind the corporate firewall. Deploying Hadoop securely on-premise means utilizing Hadoop features such as Kerberos authentication and setting the right file permissions; It’s quite unnerving for business executives, the idea of storing sensitive information offsite with a cloud provider. However, today’s cloud service providers typically apply modern cloud security protocols such as built-in encryption, to protect data during transfer and at rest.

With Hadoop on the cloud, it depends on the service provider’s implementation, so it’s essential to check their security policy. Amazon Web Services does provide security features such as virtual private clouds, encryption, security groups, and more, so Hadoop on the cloud can be secure if implemented correctly.

Customizability:

Hadoop on the cloud comes with certain limitations on which distribution or Hadoop project is available and whether or not custom code can be written.

Go where the data lives:

For big data workloads, what matters most is where the data being processed lives. If a social platform is cloud-based, it makes sense for analytics platform to be cloud-based. It is more economical and efficient to process data on in-house data center servers if the data already exists in-house. This helps minimize networking charges and increase access times and time-to-analysis.

Summary:

In choosing between an on-premise or cloud-based Hadoop platform, both IT and business needs to ensure that the selected solution works best from both a technical and business standpoint.

Though many large organizations prefer running Hadoop on-premise especially the one that uses their Hadoop 24/7, they still need to customize it and have the right budget and workforce.