Innovation in data processing and machine learning technology

Updating Cloud Dataproc for faster speeds and more resiliency

Friday, January 26, 2018

By James Malone, Cloud Dataproc Product Manager

When you have critical Apache Hadoop, Apache Spark, or Apache Hive applications, you probably don’t want a single point of failure to pose a risk. But in traditional Spark and Hadoop clusters, the single master node can be just that—a single point of failure. Although VM failure in Google Cloud Dataproc is unlikely due to features like live migration for virtual machines, we’ve heard from some enterprise customers that high availability is a must for critical applications. That’s why Cloud Dataproc now supports high availability (HA), both Apache Hadoop YARN and Apache Hadoop HDFS, in general availability. Cloud Dataproc HA helps eliminate worries about a single point of failure for critical workloads.

By default, Cloud Dataproc clusters use one “master” node in a cluster. However, high availability in Cloud Dataproc lets you run an odd number (3) of master nodes for redundancy. While the failure of a single master node is highly unlikely in Cloud Dataproc, high availability provides a mechanism to tolerate the failure of a master node so cluster operation is uninterrupted. Although HDFS is supported in high availability, we still highly recommend you use Google Cloud Storage for storing your data. Cloud Dataproc clusters provide a Cloud Storage connector for Hadoop that is easy to use and provides major benefits over HDFS, including multiple storage classes, high durability, and performance. In most cases, Cloud Dataproc uses HDFS for temporary data while jobs run to minimize network traffic and to increase performance.

High availability in Cloud Dataproc has been implemented by supporting the native high availability functionality in Apache Hadoop. Apache Zookeeper is used for the election of master nodes in high availability. You can implement high availability on your clusters when they are created using the Google Cloud Console or the Google Cloud SDK. For example, using high availability in the Cloud SDK is as simple as specifying that a cluster should have three master nodes:

In addition to increasing reliability, we also now offer an option for greater performance on Cloud Dataproc. For higher performance clusters, you can now create clusters which use SSD persistent disks (PD-SSD) as a beta feature in Cloud Dataproc. SSD persistent disks are designed for workloads with high rates of random IOPS which may provide large benefits to some Spark and Hadoop workloads. For example, PD-SSD may be ideal for applications which read and write data frequently. You can use PD-SSD disks on Cloud Dataproc with master, worker, and preemptible nodes. For more information, review the Cloud Dataproc PD-SSD documentation.

You can choose to use PD-SSD disks when creating clusters in the Google Cloud Console or by using these optional arguments when creating clusters with the Google Cloud SDK:

master-boot-disk-type

worker-boot-disk-type

preemptible-worker-boot-disk-type

We hope these updates to Cloud Dataproc provide you with even better resilience and higher performance. For more information about Cloud Dataproc, check out the Cloud Dataproc documentation. You can also use the google-cloud-dataproc tag on Stack Overflow for help or useful tips.