Run the Cloud Shell walkthrough

Understand Python example code

JSON: The Cloud Dataproc API makes extensive use of JSON. JSON objects are used as containers for both request and response data.

Application Default Credentials

The Cloud Shell walkthrough in this tutorial
provides authentication by using your GCP project credentials.
When you run code locally, the recommended practice is to use
service account credentials
to authenticate your code. The following example uses
Application Default Credentials
to check the local machine's GOOGLE_APPLICATION_CREDENTIALS
environment variable for the location of the service account key file
on the local machine. These credentials are passed to googleapiclient.discovery.build()
in the get_client() function, which returns a client to the Cloud Dataproc API.

Create a Cloud Dataproc cluster

You can create a new Cloud Dataproc cluster with the Cloud Dataproc
clusters.create API.

You must specify the following values when creating a cluster:

The project in which the cluster will be created

The name of the cluster

The region to use. If you specify the global region, you must also specify a zone.
If you specify a non-global region and set zone="",
[Cloud Dataproc Auto Zone Placement] will select a zone for your cluster.

You can also override default cluster settings. For example, you can specify the number of workers
(default = 2), whether to use preemptible VMs (default = 0), and network settings (default = default network. See the clusters.create API for more information.

Since the job runs asynchronously, the job must finish before the output is displayed.
You can check a job's status to determine if the job is finished.
After the job finishes, you can call projects.regions.jobs.get to get details about the job
and then inspect the job output for further details.