Cloud Dataflow pricing

This page describes pricing for Cloud Dataflow. To see the pricing for other
products, read the Pricing documentation.

Pricing overview

While the rate for pricing is based on the hour, Cloud Dataflow service
usage is billed in per second increments, on a per job basis. Usage is
stated in hours (30 minutes is 0.5 hours, for example) in order to apply hourly
pricing to second-by-second use. Workers and jobs may consume resources as
described in the following sections.

Workers and worker resources

Each Cloud Dataflow job uses at least one Cloud Dataflow worker.
The Cloud Dataflow service provides two worker types: batch and
streaming. There are separate service charges for batch and streaming workers.

Cloud Dataflow workers consume the following resources, each billed
on a per second basis.

Batch and streaming workers are specialized resources that use
Compute Engine. However, a Cloud Dataflow job will not emit
Compute Engine billing for Compute Engine resources managed by
the Cloud Dataflow service. Instead, Cloud Dataflow service
charges will encompass the use of these Compute Engine resources.

You can override the default worker count for a job. If you are using
autoscaling, you can
specify the maximum number of workers to be allocated to a job. Workers and
respective resources will be added and removed automatically based on
autoscaling actuation.

In addition, you can use
pipeline options to override the default resource settings (machine type,
disk type, and disk size) that are allocated to each worker.

Cloud Dataflow also provides an optional highly scalable feature,
Cloud Dataflow Shuffle, which is available only for batch pipelines and
shuffles data outside of workers. Shuffle charges by the volume of data
processed. You can instruct Cloud Dataflow to use Shuffle by specifying
the Shuffle pipeline parameter.

Similar to Shuffle, the Cloud Dataflow Streaming Engine moves
streaming shuffle and state processing out of the worker VMs and into the
Cloud Dataflow service backend. You instruct Cloud Dataflow to
use the Streaming Engine for your streaming pipelines by specifying the
Streaming Engine pipeline parameter.
Streaming Engine usage is billed by the volume of streaming data processed,
which depends on the volume of data ingested into your streaming pipeline and
the complexity and number of pipeline stages. Examples of what counts as a byte
processed include input flows from data sources, flows of data from one fused
pipeline stage to another fused stage, flows of data persisted in user-defined
state or used for windowing, and output messages to data sinks, such as to
Cloud Pub/Sub or BigQuery.

Cloud Dataflow also provides an option with discounted CPU and memory
pricing for batch processing. Flexible Resource Scheduling (FlexRS) combines
regular and preemptible VMs in a single Cloud Dataflow worker pool,
giving users access to cheaper
processing resources. FlexRS also delays the execution of a batch
Cloud Dataflow job within a 6-hour window to identify the best point in
time to start the job based on available resources. While Cloud Dataflow
uses a combination of workers to execute a FlexRS job, you are billed a uniform
discounted rate compared to regular Cloud Dataflow prices, regardless of
the worker type. You instruct Cloud Dataflow to use FlexRS for your
autoscaled batch pipelines by specifying the
FlexRS parameter.

Additional job resources

In addition to worker resource usage, a job might consume the following
resources, each billed at its own pricing, including but not limited to:

4 Cloud Dataflow Shuffle is currently available for batch pipelines in the following regions:

us-central1 (Iowa)

europe-west1 (Belgium)

europe-west4 (Netherlands)

asia-northeast1 (Tokyo)

It will become available in other regions in the future.

5 Cloud Dataflow Streaming Engine uses the Streaming Data Processed pricing unit. Streaming Engine is currently available in the following regions:

us-central1 (Iowa)

europe-west1 (Belgium)

asia-northeast1 (Tokyo)

europe-west4 (Netherlands)

It will become available in other regions in the future.

6 Prior to May 3, 2018, Cloud Dataflow
Shuffle was billed by the amount of data shuffled times the time it took to
shuffle the data and keep it in Shuffle’s memory; the price was $0.0216 per
Gigabyte per Hour. After May 3, 2018, Shuffle is priced exclusively by the
amount of data that the Cloud Dataflow service infrastructure reads and writes in the process
of shuffling your dataset; the pricing unit is Gigabytes with the time
dependency removed from billing consideration. Users with large or very large
datasets should expect to see significant reductions in their total Shuffle
costs.
To further encourage the adoption of service-based Shuffle, the first
5 Terabytes of Shuffle Data Processed are charged at rates reduced by 50%.
For example, if your pipeline results in 1 TB of actual
Shuffle Data Processed, you are charged only for 50% of that data volume
(0.5 TB). If your pipeline results in 10 TB of actual Shuffle Data Processed,
you are charged for 7.5 TB, because the first 5 TB of that volume are charged at
50% reduced rates.

Viewing usage

You can view the total vCPU, memory, and Persistent Disk resources associated
with a job either in the Google Cloud Platform Console or via the
gcloud command line tool. You
can track both the actual and chargeable Shuffle Data Processed and Streaming Data Processed metrics in the
Cloud Dataflow Monitoring Interface.
You can use the actual Shuffle Data Processed to evaluate the performance of
your pipeline and the chargeable Shuffle Data Processed to determine the costs
of the Cloud Dataflow job. For Streaming Data Processed, the actual and
chargeable metrics are identical.