Cloud Dataproc

A faster, easier, more cost-effective way to run Apache Spark and Apache Hadoop

Cloud-native Apache Hadoop & Apache Spark

Cloud Dataproc is a fast, easy-to-use, fully-managed cloud service for running
Apache Spark and
Apache Hadoop clusters
in a simpler, more cost-efficient way. Operations that used to take hours or days
take seconds or minutes instead, and you pay only for the resources you use
(with per-second billing). Cloud Dataproc also easily integrates with other
Google Cloud Platform (GCP) services, giving you a powerful and complete platform
for data processing, analytics and machine learning.

Fast & Scalable Data Processing

Create Cloud Dataproc clusters quickly and resize them at
any time—from three to hundreds of nodes—so you
don't have to worry about your data pipelines outgrowing your clusters.
With each cluster action taking less than 90 seconds on average,
you have more time to focus on insights, with less time lost to
infrastructure.

Affordable Pricing

Adopting Google Cloud Platform pricing principles, Cloud Dataproc
has a low cost and an easy to understand price structure,
based on actual use, measured by the second. Also, Cloud
Dataproc clusters can include lower-cost preemptible instances,
giving you powerful clusters at an even lower total cost.

Open Source Ecosystem

The Spark and Hadoop ecosystem provides tools, libraries,
and documentation that you can leverage with Cloud Dataproc.
By offering frequently updated and native versions of Spark,
Hadoop, Pig, and Hive, you can get started without needing to
learn new tools or APIs, and you can move existing projects or
ETL pipelines without redevelopment.

Cloud Dataproc Features

Google Cloud Dataproc is a managed Apache Spark and Apache Hadoop service that is fast, easy to use, and low cost.

Automated Cluster Management

Managed deployment, logging, and monitoring let you focus
on your data, not on your cluster. Your clusters will
be stable, scalable, and speedy.

Resizable Clusters

Clusters can be created and scaled quickly
with a variety of virtual machine types, disk sizes, number of nodes,
and networking options.

Cloud Dataflow vs. Cloud Dataproc: Which should you use?

Cloud Dataproc and Cloud Dataflow
can both be used for data processing,
and there’s overlap in their batch and streaming capabilities. How do you decide which
product is a better fit for your environment?

Cloud Dataproc

Cloud Dataproc is good for environments dependent on specific components of the Apache big data ecosystem:

checkTools/packages

checkPipelines

checkSkill sets of existing resources

Cloud Dataflow

Cloud Dataflow is typically the preferred option for greenfield environments:

If you pay in a currency other than USD, the prices listed in your currency on
Cloud Platform SKUs apply.

1 Google Cloud Dataproc incurs a small incremental fee per virtual
CPU in the Compute Engine instances used in your cluster while the cluster is operational. Additional
resources used by Cloud Dataproc, such as a Compute Engine network, BigQuery,
Cloud Bigtable, and others, are billed as they are consumed.
For detailed pricing information, please view the
pricing guide.