Deploying a Pipeline

After you construct and test your Apache Beam pipeline, you can use the Cloud Dataflow
managed service to deploy and execute it. Once on the Cloud Dataflow service, your pipeline
code becomes a Cloud Dataflow job.

You can control some aspects of how the Cloud Dataflow service runs your job
by setting execution parameters in your pipeline
code. For example, the execution parameters specify whether the steps of your pipeline run on
worker virtual machines, on the Cloud Dataflow service backend, or locally.

In addition to managing GCP resources, the Cloud Dataflow service
automatically performs and optimizes many aspects of distributed parallel processing. These include:

Optimization. Cloud Dataflow uses your pipeline code to create an
execution graph that represents your pipeline's PCollections and transforms, and
optimizes the graph for the most efficient performance and resource usage. Cloud Dataflow also
automatically optimizes potentially costly operations, such as data aggregations.

Automatic Tuning features. The Cloud Dataflow service includes several features that provide
on-the-fly adjustment of resource allocation and data partitioning, such as Autoscaling and
Dynamic Work Rebalancing. These features help the Cloud Dataflow service execute your job as quickly
and efficiently as possible.

Pipeline Lifecycle: From Pipeline Code to Cloud Dataflow Job

When you run your Cloud Dataflow program, Cloud Dataflow creates an execution
graph from the code that constructs your Pipeline object, including all of the
transforms and their associated processing functions (such as DoFns). This phase is
called Graph Construction Time. During graph construction, Cloud Dataflow checks
for various errors and ensures that the your pipeline graph doesn't contain any illegal
operations. The execution graph is translated into JSON format, and the JSON execution graph is
transmitted to the Cloud Dataflow service endpoint.

Note: Graph construction also happens when you execute your pipeline locally,
but the graph is not translated to JSON or transmitted to the service. Instead, the graph is run
locally on the same machine where you launched your Cloud Dataflow program. See the documentation on
configuring for local
execution for more details.

The Cloud Dataflow service then validates the JSON execution graph. When the graph is validated,
it becomes a job on the Cloud Dataflow service. You'll be able to see your job, its execution
graph, status, and log information by using the
Cloud Dataflow Monitoring Interface.

Python

The Cloud Dataflow service sends a response to the machine where you ran your Cloud Dataflow
program. This response is encapsulated in the object DataflowPipelineResult, which
contains your Cloud Dataflow job's job_id. You can use the job_id to monitor,
track, and troubleshoot your job using the
Cloud Dataflow Monitoring Interface and the
Cloud Dataflow Command-line Interface.

Execution Graph

Cloud Dataflow builds a graph of steps that represents your pipeline, based on the transforms and
data you used when you constructed your Pipeline object. This is the pipeline
execution graph.

The WordCount example, included
with the Apache Beam SDKs, contains a series of transforms to read, extract, count, format,
and write the individual words in a collection of text, along with an occurrence count for each
word. The following diagram shows how the transforms in the WordCount pipeline are expanded into
an execution graph:

Figure 1: WordCount Example Execution Graph

The execution graph often differs from the order in which you specified your
transforms when you constructed the pipeline. This is because the Cloud Dataflow service
performs various optimizations and fusions on the execution graph before it runs on managed
cloud resources. The Cloud Dataflow service respects data dependencies when executing your pipeline;
however, steps without data dependencies between them may be executed in any order.

Parallelization and Distribution

The Cloud Dataflow service automatically parallelizes and distributes the processing logic in your
pipeline to the workers you've allotted to perform your job. Cloud Dataflow uses the abstractions in the
programming model to represent parallel processing
functions; for example, your ParDo transforms cause Cloud Dataflow
to automatically distribute your processing code (represented by DoFns) to multiple
workers to be run in parallel.

Structuring your User Code

You can think of your DoFn code as small, independent entities: there can
potentially be many instances running on different machines, each with no knowledge of the others.
As such, pure functions (functions that do not depend on hidden or external state, that
have no observable side effects, and are deterministic) are ideal code for the parallel and
distributed nature of DoFns.

The pure function model is not strictly rigid, however; state information or external
initialization data can be valid for DoFn and other function objects, so long as your
code does not depend on things that the Cloud Dataflow service does not guarantee. When structuring your
ParDo transforms and creating your DoFns, keep the following guidelines
in mind:

The Cloud Dataflow service guarantees that every element in your input PCollection is
processed by a DoFn instance exactly once.

The Cloud Dataflow service does not guarantee how many times a DoFn will be
invoked.

The Cloud Dataflow service does not guarantee exactly how the distributed elements are
grouped—that is, it does not guarantee which (if any) elements are processed
together.

The Cloud Dataflow service does not guarantee the exact number of DoFn instances that
will be created over the course of a pipeline.

The Cloud Dataflow service is fault-tolerant, and may retry your code multiple times in the case
of worker issues. The Cloud Dataflow service may create backup copies of your code, and can have
issues with manual side effects (such as if your code relies upon or creates temporary files
with non-unique names).

The Cloud Dataflow service serializes element processing per DoFn instance.
Your code does not need to be strictly thread-safe; however, any state shared between multiple
DoFn instances must be thread-safe.

Error and Exception Handling

Your pipeline may throw exceptions while processing data. Some of these errors are transient
(e.g., temporary difficulty accessing an external service), but some are permanent, such as errors
caused by corrupt or unparseable input data, or null pointers during computation.

Cloud Dataflow processes elements in arbitrary bundles, and will retry the complete bundle when an
error is thrown for any element in that bundle. When running in batch mode, bundles including
a failing item are retried 4 times. The pipeline will fail completely when a single bundle has
failed 4 times. When running in streaming mode, a bundle including a failing item will be
retried indefinitely, which may cause your pipeline to permanently stall.

Note: When processing in batch mode, you might see a large number of individual failures
before a pipeline job fails completely (which happens when any given bundle fails after four
retry attempts). For example, if your pipeline attempts to process 100 bundles,
Cloud Dataflow could theoretically generate several hundred individual failures until a single bundle
reaches the 4-failure condition for exit.

Fusion Optimization

Once the JSON form of your pipeline's execution graph has been validated, the Cloud Dataflow service
may modify the graph to perform optimizations. Such optimizations can include fusing multiple
steps or transforms in your pipeline's execution graph into single steps. Fusing steps prevents
the Cloud Dataflow service from needing to materialize every intermediate PCollection in
your pipeline, which can be costly in terms of memory and processing overhead.

While all the transforms you've specified in your pipeline construction are executed on the
service, they may be executed in a different order, or as part of a larger fused transform to
ensure the most efficient execution of your pipeline. The Cloud Dataflow service will respect data
dependencies between the steps in the execution graph, but otherwise steps may be executed in any
order.

Fusion Example

The following diagram shows how the execution graph from the
WordCount example included with
the Apache Beam SDK for Java might be optimized and fused by the Cloud Dataflow
service for efficient execution:

Figure 2: WordCount Example Optimized Execution Graph

Preventing Fusion

There are a few cases in your pipeline where you may want to prevent the Cloud Dataflow service from
performing fusion optimizations. These are cases in which the Cloud Dataflow service might incorrectly
guess the optimal way to fuse operations in the pipeline, which could limit the Cloud Dataflow service's
ability to make use of all available workers.

For example, one case in which fusion can limit Cloud Dataflow's ability to optimize worker usage is a
"high fan-out" ParDo. In such an operation, you might have an input collection with
relatively few elements, but the ParDo produces an output with hundreds or thousands
of times as many elements, followed by another ParDo. If the Cloud Dataflow service fuses
these ParDo operations together, parallelism in this step is limited to at most the
number of items in the input collection, even though the intermediate PCollection
contains many more elements.

You can prevent such a fusion by adding an operation to your pipeline that forces the Cloud Dataflow
service to materialize your intermediate PCollection. Consider using one of
the following operations:

You can insert a GroupByKey and ungroup after your first ParDo. The
Cloud Dataflow service never fuses ParDo operations across an aggregation.

You can pass your intermediate PCollection as a
side input
to another ParDo. The Cloud Dataflow service always materializes side inputs.

Combine Optimization

Aggregation operations are an important concept in large-scale data processing. Aggregation
brings together data that's conceptually far apart, making it extremely useful for correlating.
The Cloud Dataflow programming model represents
aggregation operations as the GroupByKey, CoGroupByKey, and
Combine transforms.

Cloud Dataflow's aggregation operations combine data across the entire data set, including data that
may be spread across multiple workers. During such aggregation operations, it's often most
efficient to combine as much data locally as possible before combining data across instances. When
you apply a GroupByKey or other aggregating transform, the Cloud Dataflow service
automatically performs partial combining locally before the main grouping operation.

Note: Because the Cloud Dataflow service automatically performs partial
local combining, it is strongly recommended that you do not attempt to make this optimization
by hand in your pipeline code.

When performing partial or multi-level combining, the Cloud Dataflow service makes different decisions
based on whether your pipeline is working with batch or streaming data. For bounded data, the
service favors efficiency and will perform as much local combining as possible. For unbounded
data, the service favors lower latency, and may not perform partial combining (as it may increase
latency).

Autotuning Features

The Cloud Dataflow service contains several autotuning features that can further
dynamically optimize your Cloud Dataflow job while it is running. These features include
Autoscaling and
Dynamic Work Rebalancing.

Autoscaling

With autoscaling enabled, the Cloud Dataflow service automatically chooses the appropriate number of
worker instances required to run your job. The Cloud Dataflow service may also dynamically re-allocate
more workers or fewer workers during runtime to account for the characteristics of your job.
Certain parts of your pipeline may be computationally heavier than others, and the Cloud Dataflow
service may automatically spin up additional workers during these phases of your job (and shut
them down when they're no longer needed).

Java: SDK 2.x

Autoscaling is enabled by default on all batch Cloud Dataflow jobs. You can disable autoscaling by
explicitly specifying the option
--autoscalingAlgorithm=NONE when you run your pipeline; if so, note that the Cloud Dataflow
service sets the number of workers based on the --numWorkers option, which defaults
to 3.

If your Cloud Dataflow job uses an earlier version of the SDK, you can enable autoscaling by
specifying the option
--autoscalingAlgorithm=THROUGHPUT_BASED when you run your pipeline.

Note: With autoscaling enabled, the Cloud Dataflow service does not allow
user control of the exact number of worker instances allocated to your job. You may still
cap the number of workers by
specifying the
--maxNumWorkers option when you run your pipeline.

Python

Autoscaling is enabled by default on all batch Cloud Dataflow jobs created using the
Apache Beam SDK for Python version 0.5.1 or higher. You can disable autoscaling by explicitly
specifying the option
--autoscaling_algorithm=NONE when you run your pipeline; if so, note that the Cloud Dataflow
service sets the number of workers based on the --num_workers option, which defaults
to 3.

If your Cloud Dataflow job uses an earlier version of the SDK, you can enable autoscaling by
specifying the option
--autoscaling_algorithm=THROUGHPUT_BASED when you run your pipeline.

Note: With autoscaling enabled, the Cloud Dataflow service does not allow
user control of the exact number of worker instances allocated to your job. You may still
cap the number of workers by
specifying the
--max_num_workers option when you run your pipeline.

Java: SDK 1.x

Autoscaling is enabled by default on all batch Cloud Dataflow jobs created using the Cloud Dataflow SDK for
Java version 1.6.0 or higher. You can disable autoscaling by explicitly
specifying the option
--autoscalingAlgorithm=NONE when you run your pipeline; if so, note that the Cloud Dataflow
service sets the number of workers based on the --numWorkers option, which defaults
to 3.

If your Cloud Dataflow job uses an earlier version of the SDK, you can enable autoscaling by
specifying the option
--autoscalingAlgorithm=THROUGHPUT_BASED when you run your pipeline.

Note: With autoscaling enabled, the Cloud Dataflow service does not allow
user control of the exact number of worker instances allocated to your job. You may still
cap the number of workers by
specifying the
--maxNumWorkers option when you run your pipeline.

Batch Autoscaling

For bounded data in batch mode, Cloud Dataflow automatically chooses the number of workers based
on both the amount of work in each stage of your pipeline and the current throughput at that
stage.

If your pipeline uses a custom data source that you've implemented, there
are a few methods you can implement that provide more information to the
Cloud Dataflow service's autoscaling algorithm and potentially improve
performance:

Java: SDK 2.x

In your BoundedSource subclass, implement the method
getEstimatedSizeBytes. The Cloud Dataflow service uses getEstimatedSizeBytes
when calculating the initial number of workers to use for your pipeline.

In your BoundedReader subclass, implement the method
getFractionConsumed. The Cloud Dataflow service uses getFractionConsumed to
track read progress and converge on the correct number of workers to use during a read.

Python

In your BoundedSource subclass, implement the method
estimate_size. The Cloud Dataflow service uses estimate_size
when calculating the initial number of workers to use for your pipeline.

In your RangeTracker subclass, implement the method
fraction_consumed. The Cloud Dataflow service uses fraction_consumed to
track read progress and converge on the correct number of workers to use during a read.

Java: SDK 1.x

In your BoundedSource subclass, implement the method
getEstimatedSizeBytes. The Cloud Dataflow service uses getEstimatedSizeBytes
when calculating the initial number of workers to use for your pipeline.

In your BoundedReader subclass, implement the method
getFractionConsumed. The Cloud Dataflow service uses getFractionConsumed to
track read progress and converge on the correct number of workers to use during a read.

Streaming Autoscaling

Java: SDK 2.x

Beta

This is
a beta
release of Streaming Autoscaling.
This feature
might be changed in backward-incompatible ways
and
is not
subject to any SLA or deprecation policy.

Streaming autoscaling allows the Cloud Dataflow service to adaptively change the number of workers used
to execute your streaming pipeline in response to changes in load and resource utilization.
Streaming autoscaling is a free feature and is designed to reduce the costs of the resources
used when executing streaming pipelines.

Without autoscaling, you would choose a fixed number of workers (by specifying
--numWorkers) to execute your pipeline. As the input workload varies over time, this
number can become either too high or too low. Provisioning too many workers results in unnecessary
extra cost, and provisioning too few workers results in higher latency for processed data. By
enabling autoscaling, resources are used only as they are needed.

To make scaling decisions, autoscaling relies on several signals that assess how busy workers are
and whether they can keep up with the input stream. Key signals include CPU utilization,
throughput, and backlog. The objective is to minimize backlog while maximizing
worker utilization and throughput, and quickly react to spikes in load. By enabling autoscaling,
you don't have to choose between provisioning for peak load and fresh results. Workers are added
as CPU utilization and backlog increase and are removed as these metrics come down. This way,
you’re paying only for what you need, and the job is processed as efficiently as possible.

If your pipeline uses a custom unbounded source, it is essential that the source informs
the Cloud Dataflow service about backlog. Backlog is an estimate of the input in bytes that has not
yet been processed by the source. To inform the service about backlog, implement either one of
the two following methods in your UnboundedReader class:

getSplitBacklogBytes() - Backlog for the current split of the source. The service
aggregates backlog across all the splits.

getTotalBacklogBytes() - The global backlog across all the splits. In some cases
the backlog is not available for each split and can only be calculated across all the splits.
Only the first split (split id ‘0’) needs to provide total backlog.

Enable Streaming Autoscaling

Autoscaling may oscillate between N/15 and N workers during the execution of a pipeline, where N
is the value of --maxNumWorkers. For example, if your pipeline needs 3 or 4 workers
in steady state, you could set --maxNumWorkers=15 and the pipeline will automatically
scale between 1 and 15 workers.

Streaming pipelines are deployed with a fixed pool of
persistent disks, equal in number to
--maxNumWorkers. Take this into account when you specify --maxNumWorkers,
and ensure this value is a sufficient number of disk(s) for your pipeline.

NOTE: If you've reached a scaling limit and want to raise the --maxNumWorkers,
you'll need to submit a new job with a higher --maxNumWorkers.

If you want to update a streaming autoscaling job, make sure --maxNumWorkers
remains the same (see the section on
manually scaling streaming pipelines).
Note that not specifying the --autoscalingAlgorithm pipeline option in the
Update command disables autoscaling for the updated job.

Currently, PubsubIO is the only source that supports autoscaling on streaming pipelines. All
SDK-provided
sinks are supported. In this Beta release, Autoscaling works smoothest when reading
from Cloud Pub/Sub subscriptions tied to topics published with small batches and when writing to sinks
with low latency. In extreme cases (i.e. Cloud Pub/Sub subscriptions with large publishing batches or
sinks with very high latency), autoscaling is known to become coarse-grained. This will be
improved in future releases.

Usage and Pricing

Compute Engine usage is based on the average number of workers, while persistent disk
usage is based on the exact number of --maxNumWorkers. Persistent disks are
redistributed such that each worker gets an equal number of attached disks.

In the example above, where --maxNumWorkers=15, you will
pay for between 1 and 15 Compute Engine instances and exactly 15
persistent disks.

Python

This feature is not yet supported in the Apache Beam SDK for Python.

Java: SDK 1.x

Beta

This is
a beta
release of Streaming Autoscaling.
This feature
might be changed in backward-incompatible ways
and
is not
subject to any SLA or deprecation policy.

Streaming autoscaling allows the Cloud Dataflow service to adaptively change the number of workers used
to execute your streaming pipeline in response to changes in load and resource utilization.
Streaming autoscaling is a free feature and is designed to reduce the costs of the resources
used when executing streaming pipelines.

Without autoscaling, you would choose a fixed number of workers (by specifying
--numWorkers) to execute your pipeline. As the input workload varies over time, this
number can become either too high or too low. Provisioning too many workers results in unnecessary
extra cost, and provisioning too few workers results in higher latency for processed data. By
enabling autoscaling, resources are used only as they are needed.

To make scaling decisions, autoscaling relies on several signals that assess how busy workers are
and whether they can keep up with the input stream. Key signals include CPU utilization,
throughput, and backlog. The objective is to minimize backlog while maximizing
worker utilization and throughput, and quickly react to spikes in load. By enabling autoscaling,
you don't have to choose between provisioning for peak load and fresh results. Workers are added
as CPU utilization and backlog increase and are removed as these metrics come down. This way,
you’re paying only for what you need, and the job is processed as efficiently as possible.

If your pipeline uses a custom unbounded source, it
is essential that the source informs the Cloud Dataflow service about backlog. Backlog is an estimate of
the input in bytes that has not yet been processed by the source. To inform the service about
backlog, implement either one of the two following methods in your UnboundedReader
class:

getSplitBacklogBytes() - Backlog for the current split of the source. The service
aggregates backlog across all the splits.

getTotalBacklogBytes() - The global backlog across all the splits. In some cases
the backlog is not available for each split and can only be calculated across all the splits.
Only the first split (split id ‘0’) needs to provide total backlog.

Enable Streaming Autoscaling

Autoscaling may oscillate between N/15 and N workers during the execution of a pipeline, where N
is the value of --maxNumWorkers. For example, if your pipeline needs 3 or 4 workers
in steady state, you could set --maxNumWorkers=15 and the pipeline will automatically
scale between 1 and 15 workers.

Streaming pipelines are deployed with a fixed pool of
persistent disks, equal in number to
--maxNumWorkers. Take this into account when you specify --maxNumWorkers,
and ensure this value is a sufficient number of disk(s) for your pipeline.

NOTE: If you've reached a scaling limit and want to raise the --maxNumWorkers,
you'll need to submit a new job with a higher --maxNumWorkers.

If you want to update a streaming autoscaling job, make sure --maxNumWorkers
remains the same (see the section on
manually scaling streaming pipelines).
Note that not specifying the --autoscalingAlgorithm pipeline option in the
Update command disables autoscaling for the updated job.

Currently, PubsubIO is the only source that supports autoscaling on streaming pipelines. All
SDK-provided
sinks are supported. In this Beta release, Autoscaling works smoothest when reading
from Cloud Pub/Sub subscriptions tied to topics published with small batches and when writing to sinks
with low latency. In extreme cases (i.e. Cloud Pub/Sub subscriptions with large publishing batches or
sinks with very high latency), autoscaling is known to become coarse-grained. This will be
improved in future releases.

Usage and Pricing

Compute Engine usage is based on the average number of workers, while persistent disk
usage is based on the exact number of --maxNumWorkers. Persistent disks are
redistributed such that each worker gets an equal number of attached disks.

In the example above, where --maxNumWorkers=15, you will
pay for between 1 and 15 Compute Engine instances and exactly 15
persistent disks.

Manually Scaling a Streaming Pipeline

Java: SDK 2.x

Until autoscaling is generally available in streaming mode, there is a workaround you can use to
manually scale the number of workers running your streaming pipeline by using Cloud Dataflow's
Update feature.

If you know you'll want to scale your streaming pipeline during execution, ensure that you set
the following execution parameters when
you start your pipeline:

Set --maxNumWorkers equal to the maximum number of workers you want
available to your pipeline.

Set --numWorkers equal to the initial number of workers you want your
pipeline to use when it starts running.

Once your pipeline is running, you can Update your pipeline and specify a new number of
workers using the --numWorkers parameter. The value you set for the new
--numWorkers must be between N and --maxNumWorkers, where
N is equal to --maxNumWorkers / 15.

Update will replace your running job with a new job, using the new number of workers, while
preserving all state information associated with the previous job.

Note: Your pipeline's maximum scaling range is dependent upon the number of persistent
disks deployed when the pipeline starts. The Cloud Dataflow service deploys one persistent disk per
worker at the maximum number of workers. Deploying extra persistent disks by setting
--maxNumWorkers to a higher value than --numWorkers provides some
benefits to your pipeline—specifically, it allows you the flexibility to scale your pipeline
to a larger number of workers after startup, and may provide
improved
performance. However, your pipeline might also incur additional cost for the extra
persistent disks. Take note of the cost and quota implications of the additional persistent disk
resources when planning your streaming pipeline and setting the scaling range.

Note: It is not possible to change the scaling range of a pipeline by using the Update
feature. If you need to scale further, you'll need to start a new pipeline and specify a higher
value for --maxNumWorkers as the ceiling of your desired scaling range.

Python

This feature is not yet supported in the Apache Beam SDK for Python.

Java: SDK 1.x

Until autoscaling is generally available in streaming mode, there is a workaround you can use to
manually scale the number of workers running your streaming pipeline by using Cloud Dataflow's
Update feature.

If you know you'll want to scale your streaming pipeline during execution, ensure that you set
the following execution parameters when
you start your pipeline:

Set --maxNumWorkers equal to the maximum number of workers you want
available to your pipeline.

Set --numWorkers equal to the initial number of workers you want your
pipeline to use when it starts running.

Once your pipeline is running, you can Update your pipeline and specify a new number of
workers using the --numWorkers parameter. The value you set for the new
--numWorkers must be between N and --maxNumWorkers, where
N is equal to --maxNumWorkers / 15.

Update will replace your running job with a new job, using the new number of workers, while
preserving all state information associated with the previous job.

Note: Your pipeline's maximum scaling range is dependent upon the number of persistent
disks deployed when the pipeline starts. The Cloud Dataflow service deploys one persistent disk per
worker at the maximum number of workers. Deploying extra persistent disks by setting
--maxNumWorkers to a higher value than --numWorkers provides some
benefits to your pipeline—specifically, it allows you the flexibility to scale your pipeline
to a larger number of workers after startup, and may provide
improved
performance. However, your pipeline might also incur additional cost for the extra
persistent disks. Take note of the cost and quota implications of the additional persistent disk
resources when planning your streaming pipeline and setting the scaling range.

Note: It is not possible to change the scaling range of a pipeline by using the Update
feature. If you need to scale further, you'll need to start a new pipeline and specify a higher
value for --maxNumWorkers as the ceiling of your desired scaling range.

Dynamic Work Rebalancing

The Cloud Dataflow service's Dynamic Work Rebalancing feature allows the service to dynamically
re-partition work based on runtime conditions. These conditions might include:

Imbalances in work assignments

Workers taking longer than expected to finish

Workers finishing faster than expected

The Cloud Dataflow service automatically detects these conditions and can dynamically reassign work to
unused or underused workers to decrease your job's overall processing time.

Limitations

Dynamic Work Rebalancing only happens when the Cloud Dataflow service is processing some input data in
parallel: when reading data from an external input source, when working with a materialized
intermediate PCollection, or when working with the result of an aggregation like
GroupByKey. If a large number of steps in your job are
fused, there are fewer intermediate PCollections in your
job and Dynamic Work Rebalancing will be limited to the number of elements in the source
materialized PCollection. If you want to ensure that Dynamic Work Rebalancing can be
applied to a particular PCollection in your pipeline, you can
prevent fusion in a few different ways to ensure dynamic
parallelism.

Dynamic Work Rebalancing cannot re-parallelize data finer than a single record. If your data
contains individual records that cause large delays in processing time, they may still delay your
job, since Cloud Dataflow cannot subdivide and redistribute an individual "hot" record to multiple
workers.

Java: SDK 2.x

If you've set a fixed number of shards for your pipeline's final output (for example, by writing
data using TextIO.Write.withNumShards), parallelization will be limited based on
the number of shards that you've chosen.

Python

If you've set a fixed number of shards for your pipeline's final output (for example, by writing
data using beam.io.WriteToText(..., num_shards=...)), Cloud Dataflow will limit
parallelization based on the number of shards that you've chosen.

Java: SDK 1.x

If you've set a fixed number of shards for your pipeline's final output (for example, by writing
data using TextIO.Write.withNumShards), parallelization will be limited based on
the number of shards that you've chosen.

The fixed-shards limitation can be considered temporary, and may be subject to
change in future releases of the Cloud Dataflow service.

Working with Custom Data Sources

Java: SDK 2.x

If your pipeline uses a custom data source that you
provide, you must implement the method splitAtFraction to allow your source to work
with the Dynamic Work Rebalancing feature.

Caution: Using Dynamic Work Rebalancing with custom data sources is an extremely advanced
use case. If you choose to implement splitAtFraction, it is critical that you test
your code extensively and with maximum code coverage.

If you implement splitAtFraction incorrectly, records from your source may appear to
get duplicated or dropped.
See the
API reference information on RangeTracker for help and tips on implementing
splitAtFraction.

Python

If your pipeline uses a custom data source that you
provide, your RangeTracker must implement try_claim, try_split,
position_at_fraction, and fraction_consumed to allow your source to work
with the Dynamic Work Rebalancing feature.

Java: SDK 1.x

If your pipeline uses a custom data source that you
provide, you must implement the method splitAtFraction to allow your source to work
with the Dynamic Work Rebalancing feature.

Caution: Using Dynamic Work Rebalancing with custom data sources is an extremely advanced
use case. If you choose to implement splitAtFraction, it is critical that you test
your code extensively and with maximum code coverage.

If you implement splitAtFraction incorrectly, records from your source may appear to
get duplicated or dropped.
See the
API reference information on RangeTracker for help and tips on implementing
splitAtFraction.

Resource Usage and Management

The Cloud Dataflow service fully manages resources in GCP on a per-job basis. This
includes spinning up and shutting down Compute Engine instances
(occasionally referred to as workers or VMs) and accessing your project's
Cloud Storage buckets for both I/O and temporary file staging.
However, if your pipeline interacts with GCP data storage technologies like
BigQuery and Cloud Pub/Sub, you
must manage the resources and quota for those services.

Cloud Dataflow uses a user provided location in Cloud Storage
specifically for staging files. This location is under your control, and you should ensure that
the location's lifetime is maintained as long as any job is reading from it. You can re-use the
same staging location for multiple job runs, as the SDK's built-in caching can speed up the start
time for your jobs.

Caution: Manually altering Cloud Dataflow-managed Compute Engine resources associated with a
Cloud Dataflow job is an unsupported operation. You should not attempt to manually stop, delete, or
otherwise control the Compute Engine instances that Cloud Dataflow has created to run your job. In
addition, you should not alter any persistent disk resources associated with your Cloud Dataflow
job.

Jobs

You may run up to 25 concurrent Cloud Dataflow jobs per GCP project.

The Cloud Dataflow service is currently limited to processing job requests that are 10MB in size or
smaller. The size of the job request is specifically tied to the JSON representation of your
pipeline; a larger pipeline means a larger request.

To estimate the size of your pipeline's JSON request, run your pipeline with the following
option:

Java: SDK 2.x

--dataflowJobFile=< path to output file >

Python

--dataflow_job_file=< path to output file >

Java: SDK 1.x

--dataflowJobFile=< path to output file >

This command writes a JSON representation of your job to a file. The size of the serialized file
is a good estimate of the size of the request; the actual size will be slightly larger due to some
additional information included in the request.

Workers

The Cloud Dataflow service currently allows a maximum of 1000 Compute Engine instances per job.
The default machine type is n1-standard-1 for a batch job, and
n1-standard-4 for streaming; when using the default machine types, the Cloud Dataflow
service can therefore allocate up to 4000 cores per job.

Note: The Cloud Dataflow managed service now deploys Compute Engine virtual machines
associated with Cloud Dataflow jobs using Managed Instance
Groups. A Managed Instance Group creates multiple Compute Engine instances from a common
template and allows you to control and manage them as a group. That way, you don't have to
individually control each instance associated with your pipeline.

You should not attempt to manage or otherwise interact directly with your Compute Engine
Managed Instance Group; the Cloud Dataflow service will take care of that for you. Manually altering
any Compute Engine resources associated with your Cloud Dataflow job is an unsupported operation.

Java: SDK 1.x

Resource Quota

The Cloud Dataflow service checks to ensure that your GCP project has the Compute Engine
resource quota required to run your job, both to start the job and scale to the maximum number of
worker instances. Your job will fail to start if there is not enough resource quota available.

If your Cloud Dataflow job deploys Compute Engine virtual machines as a Managed Instance Group,
you'll need to ensure your project satisfies some additional quota requirements. Specifically,
your project will need one of the following types of quota for each concurrent Cloud Dataflow job that
you want to run:

Cloud Dataflow's Autoscaling feature is limited by your project's
available Compute Engine quota. If your job has sufficient quota when it starts, but another job
uses the remainder of your project's available quota, the first job will run but not be able to
fully scale.

However, the Cloud Dataflow service does not manage quota increases for jobs that exceed the
resource quotas in your project. You are responsible for making any necessary requests for
additional resource quota, for which you can use the
Google Cloud Platform Console.

Persistent Disk Resources

The Cloud Dataflow service is currently limited to 15 persistent disks per worker instance when
running a streaming job. Each persistent disk is local to an individual Compute Engine virtual
machine. Your job may not have more workers than persistent disks; a 1:1 ratio between workers and
disks is the minimum resource allotment.

The default size of each persistent disk is 250 GB in batch mode and 400 GB in
streaming mode.

Locations

By default, the Cloud Dataflow service deploys
Compute Engine resources in the us-central1-f zone of the
us-central1 region. You can override this setting by
specifying the
--region parameter. If you need to use a specific zone for your
resources, use the --zone parameter when you create your pipeline.
However, we recommend that you only specify the region, and leave the zone
unspecified. This allows the Cloud Dataflow service to automatically
select the best zone within the region based on the available zone capacity at
the time of the job creation request. For more information, see the
regional endpoints documentation.

Streaming Engine

Beta

This is
a beta
release of Streaming Engine.
This feature
might be changed in backward-incompatible ways
and
is not
subject to any SLA or deprecation policy.

Benefits of Streaming Engine

The Streaming Engine model has the following benefits:

A reduction in consumed CPU, memory, and Persistent Disk storage resources on the worker VMs.
The Streaming Engine works best with smaller worker machine types (n1-standard-2 instead of
n1-standard-4) and does not require Persistent Disk beyond a small worker boot disk,
leading to less resource and quota consumption.

More responsive autoscaling
in response to variations in incoming data volume. The Streaming Engine offers smoother,
more granular scaling of workers.

Improved supportability, since you don’t need to redeploy your pipelines to apply service updates.

Most of the reduction in worker resources comes from offloading the work to the Cloud Dataflow service.
For that reason, there is a charge
associated with the use of the Streaming Engine. However, the total bill for Cloud Dataflow pipelines
using the Streaming Engine is expected to be approximately the same compared
to the total cost of Cloud Dataflow pipelines that do not use this option.

Using Streaming Engine

Note: Streaming Engine is currently available in beta in
the us-central1 and europe-west1 regions. It will
become available in additional regions in the future.

Java: SDK 2.x

To use the Streaming Engine for your streaming pipelines, specify the parameter
--experiments=enable_streaming_engine.

The Streaming Engine can run jobs in the us-central1
and europe-west1 regions; specify --region=europe-west1 or --region=us-central1,
or set a --zone in one of these two regions. If you specify a region outside
of the two supported regions, Cloud Dataflow will report an error.

The Streaming Engine works best with smaller worker machine types, so we recommend setting
--workerMachineType=n1-standard-2. You can also set --diskSizeGb=30
because the Streaming Engine only needs space for the worker boot image and local logs. These values
are the defaults of you don't set them explicitly.

Python

This feature is not yet supported in the Apache Beam SDK for Python.

Java: SDK 1.x

Streaming Engine is not supported in the Cloud Dataflow SDK for Java version 1.x. To use
this feature, you must use Apache Beam SDK for Java 2.8.0 or greater.

Cloud Dataflow Shuffle

Cloud Dataflow Shuffle is the base operation behind Cloud Dataflow transforms such as
GroupByKey, CoGroupByKey and Combine. The Cloud Dataflow
Shuffle operation partitions and groups data by key in a scalable, efficient, fault-tolerant
manner. Currently, Cloud Dataflow uses a shuffle implementation which runs entirely on worker
virtual machines and consumes worker CPU, memory, and Persistent Disk storage. The new
service-based Cloud Dataflow Shuffle feature, available for batch pipelines only, moves the
shuffle operation out of the worker VMs and into the Cloud Dataflow service backend.

Benefits of Cloud Dataflow Shuffle

The service-based Cloud Dataflow Shuffle has the following benefits:

Faster execution time of batch pipelines for the majority of pipeline job types.

A reduction in consumed CPU, memory, and Persistent Disk storage resources on the worker VMs.

Better autoscaling
since VMs no longer hold any shuffle data and can therefore be scaled down earlier.

Better fault tolerance; an unhealthy VM holding Cloud Dataflow Shuffle
data will not cause the entire job to fail, as would happen if not using
the feature.

Most of the reduction in worker resources comes from offloading the shuffle work to the Cloud Dataflow service.
For that reason, there is a charge associated with the use of Cloud Dataflow Shuffle. However, the total bill for Cloud Dataflow pipelines
using the service-based Cloud Dataflow implementation is expected to be less than or equal
to the cost of Cloud Dataflow pipelines that do not use this option.

For the majority of pipeline job types, Cloud Dataflow Shuffle is expected to execute
faster than the shuffle implementation running on worker VMs. However, the execution times
might vary from run to run. If you are running a pipeline that has important deadlines, we
recommend allocating sufficient buffer time before the deadline. In addition, consider requesting
a bigger quota for Shuffle.

Disk considerations

When using the service-based Cloud Dataflow Shuffle feature, you do not need to attach large
Persistent Disks to your worker VMs. Cloud Dataflow automatically attaches a small 25GB boot
disk. However, due to this small disk size, there are important considerations to be aware of
when using Cloud Dataflow Shuffle.

A worker VM uses part of the 25GB of disk space for the operating system,
binaries, logs, and containers. Jobs that use a significant amount of disk and
exceed the remaining disk capacity may fail when you use Cloud Dataflow Shuffle.

Jobs that use a lot of disk I/O may be slow due to the performance of the small disk.
See the Compute Engine Persistent Disk Performance
page for more information about performance differences between disk
sizes.

If any of these considerations apply to your job, you can use
pipeline options
to specify a larger disk size.

Using Cloud Dataflow Shuffle

Note: Service-based Cloud Dataflow Shuffle is currently available in
the us-central1 (Iowa) region and in the europe-west1 region. It will
become available in additional regions in the future.

Java: SDK 2.x

Service-based Cloud Dataflow Shuffle can be turned on in batch Cloud Dataflow jobs. To turn on the
service-based Cloud Dataflow Shuffle in your batch pipelines, specify the following parameter:
--experiments=shuffle_mode=service.

Do not specify the --zone parameter if you want to use the service-based
Cloud Dataflow Shuffle feature. Instead, use the --region parameter with the
us-central1 or europe-west1 values. Cloud Dataflow will auto-select the
best zone in the us-central1 or europe-west1 region to run the Cloud Dataflow
job in. If you specify a zone outside of the us-central1 or europe-west1
region, along with the --experiments=shuffle_mode=service option, Cloud Dataflow will
report an error.

Python

To use Cloud Dataflow Shuffle with the Apache Beam SDK for Python, you
must have version 2.1.0 or higher.

Service-based Cloud Dataflow Shuffle can be turned on in batch Cloud Dataflow jobs. To turn on the
service-based Cloud Dataflow Shuffle in your batch pipelines, specify the following parameter:
--experiments shuffle_mode=service.

Do not specify the --zone parameter if you want to use the service-based
Cloud Dataflow Shuffle feature. Instead, use the --region parameter with the
us-central1 or europe-west1 values. Cloud Dataflow will auto-select the
best zone in the us-central1 or europe-west1 region to run the Cloud Dataflow
job in. If you specify a zone outside of the us-central1 or europe-west1
region, along with the --experiments shuffle_mode=service option, Cloud Dataflow will
report an error.

Java: SDK 1.x

Service-based Cloud Dataflow Shuffle can be turned on in batch Cloud Dataflow jobs. To turn on the
service-based Cloud Dataflow Shuffle in your batch pipelines, specify the following parameter:
--experiments=shuffle_mode=service.

Do not specify the --zone parameter if you want to use the service-based
Cloud Dataflow Shuffle feature. Instead, use the --region parameter with the
us-central1 or europe-west1 values. Cloud Dataflow will auto-select the
best zone in the us-central1 or europe-west1 region to run the Cloud Dataflow
job in. If you specify a zone outside of the us-central1 or europe-west1
region, along with the --experiments=shuffle_mode=service option, Cloud Dataflow will
report an error.