Cloud Dataflow

Faster development, easier management

Cloud Dataflow is a fully-managed service for transforming and enriching data
in stream (real time) and
batch (historical) modes with equal reliability and expressiveness -- no more
complex workarounds or compromises needed. And with its serverless approach to
resource provisioning and management, you have access to virtually limitless
capacity to solve your biggest data processing challenges, while paying only
for what you use.

Accelerate development for batch & streaming

Cloud Dataflow supports fast, simplified pipeline development via
expressive Java and Python APIs in the
Apache Beam SDK,
which provides a rich set of windowing and session analysis primitives as well as an
ecosystem of source and sink connectors. Plus, Beam’s unique, unified
development model lets you reuse more code across
streaming and batch pipelines.

Simplify operations & management

GCP’s serverless approach removes operational overhead with performance,
scaling, availability, security and compliance handled automatically so
users can focus on programming instead of managing server clusters.
Integration with Stackdriver, GCP’s unified
logging and monitoring solution, lets you monitor and troubleshoot your
pipelines as they are running. Rich visualization, logging, and advanced
alerting help you identify and respond to potential issues.

Build on a foundation for machine learning

Use Cloud Dataflow as a convenient integration point to bring predictive
analytics to fraud detection, real-time personalization and similar use cases
by adding TensorFlow-based Cloud Machine Learning models
and APIs to your data processing pipelines.

Cloud Dataflow vs. Cloud Dataproc: Which should you use?

Cloud Dataproc and
Cloud Dataflow can both be used for data processing,
and there’s overlap in their batch and streaming capabilities. How do you decide which
product is a better fit for your environment?

Cloud Dataproc

Cloud Dataproc is good for environments dependent on specific components of the Apache big data ecosystem:

checkTools/packages

checkPipelines

checkSkill sets of existing resources

Cloud Dataflow

Cloud Dataflow is typically the preferred option for greenfield environments:

User-friendly Pricing

Cloud Dataflow jobs are billed in per second increments, based on the actual use of Cloud
Dataflow batch or streaming workers. Jobs that consume additional GCP resources
-- such as Cloud Storage or Cloud Pub/Sub -- are each billed per that service’s pricing.

3 Cloud Dataflow Shuffle is currently available for batch pipelines in the us-central1 (Iowa) and europe-west1 (Belgium) regions only. It will become available in other regions in the future.

4 Cloud Dataflow Streaming Engine uses the Streaming Data Processed pricing unit. Streaming Engine is currently available in beta for streaming pipelines in the us-central1 (Iowa) and europe-west1 (Belgium) regions only. It will become available in other regions in the future.