Workflow templates—Overview

Beta

This is
a beta
release of Workflow Orchestration.
This feature
might be changed in backward-incompatible ways
and
is not
subject to any SLA or deprecation policy.
This feature
is not
intended for real-time usage in critical applications.

Cloud Dataproc Workflows can be instantiated
directly without creating a WorkflowTemplate by using the
InstantiateInline API.

The Cloud Dataproc
WorkflowTemplates API provides a flexible and easy-to-use mechanism for managing and executing
workflows. Create a workflow template, add one or more jobs to the template,
then instantiate the template. The instantiated template (workflow) will create
the cluster, run the jobs, then delete the cluster when the workflow is finished.
Workflow metadata includes a graph of workflow operations that can help you
monitor and analyze workflow progress and results.

Creating a workflow template does not create a Cloud
Dataproc cluster or create and submit jobs. Clusters and jobs associated with
clusters are only created when a workflow template is instantiated. Further,
you can create a template that does not create a new
cluster but, instead, runs the workflow on an existing cluster (see
Types of workflow templates).

In this document, an instantiated workflow template is
referred to as a "running workflow" or more simply as a "workflow."

Types of workflow templates

A workflow template can specify a managed-cluster. The workflow will
create this "ephemeral" cluster to run workflow jobs, and will delete this
cluster when the workflow is finished.

Alternatively, a workflow template can specify one or more existing clusters,
via one or more user labels that were
previously applied by the user to the cluster(s). Cloud Dataproc will randomly
select among the clusters to use in the workflow. At the end of workflow, the
clusters are not deleted.

Creating a template

gcloud

REST API

Console

Support for Cloud Dataproc Workflows in the Google Cloud Platform Console will be added in
a future Cloud Dataproc release.

Adding a managed-cluster to a template

A Cloud Dataproc managed cluster is created at the start of the workflow and
is used to run workflow jobs. At the end of the workflow, the managed
cluster is deleted.

gcloud

Use flags inherited from
gcloud dataproc cluster
create to configure the managed cluster (number of workers, master/worker
machine type, etc.). Cloud Dataproc will add a suffix to the cluster name to
ensure uniqueness.

Console

Support for Cloud Dataproc Workflows in the Google Cloud Platform Console will be added in
a future Cloud Dataproc release.

Adding a cluster selector to a template

Instead of using a managed cluster whose lifetime depends on the lifetime
of the workflow, you can specify a cluster selector that selects an
existing cluster to run workflow jobs. The selected cluster will not be
deleted at the end of the workflow.

You must specify at least one label when setting a cluster
selector for your workflow. Cloud Dataproc will select a cluster
whose label(s) matches the specified selector label(s).
(See Creating and Using Cloud Dataproc Labels for
instructions on how to apply labels to Cloud Dataproc clusters). If more than one
label is passed to the selector, all labels must match before a cluster will
be selected. If more than one cluster matches the specified label(s),
Cloud Dataproc will choose the cluster with the most free YARN memory (one
workflow job can be run on one matching cluster, and another job can be run on a
different matching cluster).

You also must also specify a zone. Cloud Dataproc
will run the workflow process in this zone
(see Available regions & zones
to choose a zone). Note that this parameter only affects the location where
the workflow process is run; it does not affect the selection of the cluster
where workflow jobs are run.

Console

Support for Cloud Dataproc Workflows in the Google Cloud Platform Console will be added in
a future Cloud Dataproc release.

Adding jobs to a template

All jobs run concurrently unless you specify a one or more jobs that must finish
successfully before a job can start. You must provide an step-id
for each job. The id must be unique within the
workflow, but doesn't need to be unique globally.

gcloud

Use job type and flags inherited from
gcloud dataproc jobs submit
to define the job to add to the template. You can optionally use the
‑‑start-after job-id of another workflow job
flag to have the job start after the completion of one or more other jobs
in the workflow.
Examples:
Add Hadoop job "foo" to the "my-workflow" template.

REST API

Console

Support for Cloud Dataproc Workflows in the Google Cloud Platform Console will be added in
a future Cloud Dataproc release.

Running, monitoring, stopping, and updating a workflow template

The instantiation of a workflow template runs the workflow defined by the
template. Multiple (simultaneous) instantiations of a template are supported.
The workflow checks that all resource names are unique within the workflow
(e.g., no job id collisions). It also resolves cluster resources.

gcloud

gcloud beta dataproc workflow-templates instantiate template-id

REST API

Console

Support for Cloud Dataproc Workflows in the Google Cloud Platform Console will be added in
a future Cloud Dataproc release.

Workflow job failures

A failure in any node in a workflow will cause the workflow to fail.
Cloud Dataproc will seek to mitigate the effect of failures by causing all
concurrently executing jobs to fail and preventing subsequent jobs
from starting.

REST API

Note: As a guard against concurrent modifications, a request to update a
template must specify the current server version in the
workflowTemplate.version field.
To make an update to a template with the REST API:

Call workflowTemplates.get, which returns the current template with the version field filled
in with the current server version