In the Cloud Console, on the project selector page,
select or create a Cloud project.

Note: If you don't plan to keep the
resources that you create in this procedure, create a project instead of
selecting an existing project. After you finish these steps, you can
delete the project, removing all resources associated with the project.

Set local environment variables. Set environment variables on your
local machine. Set your Google Cloud project-id and the name of the
Cloud Storage bucket you will use for this tutorial. Also provide the
name and region
of an existing or new Dataproc cluster.
You can create a cluster to use in this tutorial in the next step.

The above command installs the default
cluster image version.
You can use the
--image-version
flag to select an image version for your cluster. Each image version installs
specific versions of Spark and Scala library components. If you
prepare the Spark wordcount job
in Java or Scala, you will reference the Spark and Scala versions installed
on your cluster when you prepare the job package.

Java

Copy pom.xml file to your local machine.
The following pom.xml file specifies Scala and Spark library
dependencies, which are given a provided scope to
indicate that the Dataproc cluster will provide these
libraries at runtime. The pom.xml file does not specify a
Cloud Storage dependency because the connector implements the standard
HDFS interface. When a Spark job accesses Cloud Storage cluster files
(files with URIs that start with gs://), the system
automatically uses the Cloud Storage connector to access the
files in Cloud Storage
Check your cluster image version.
Replace the version placeholders in the file to show the
Spark and Scala library versions used by your cluster's
image version.
Note that the spark-core_ artifact number is the Scala
major.minor version number.

Scala

Copy build.sbt file to your local machine.
The following build.sbt file specifies Scala and Spark library
dependencies, which are given a provided scope to
indicate that the Dataproc cluster will provide these
libraries at runtime. The build.sbt file does not specify a
Cloud Storage dependency because the connector implements the standard
HDFS interface. When a Spark job accesses Cloud Storage cluster files
(files with URIs that start with gs://), the system
automatically uses the Cloud Storage connector to access the
files in Cloud Storage
Check your cluster image verison.
Replace the version placeholders in the file to show the
Spark and Scala library versions used by your cluster's
image version.

Python

Copy word-count.py to your local machine.
This is a simple Spark job in Python using PySpark that reads text files from
Cloud Storage, performs a word count, then writes
the text file results to Cloud Storage.

Cleaning up

After you've finished the Use Cloud Dataproc tutorial, you can clean up the
resources that you created on Google Cloud so they won't take up
quota and you won't be billed for them in the future. The following
sections describe how to delete or turn off these resources.

Deleting the project

The easiest way to eliminate billing is to delete the project that you
created for the tutorial.

To delete the project:

Caution: Deleting a project has the following effects:

Everything in the project is deleted. If you used an existing project for
this tutorial, when you delete it, you also delete any other work you've done in the project.

Custom project IDs are lost.
When you created this project, you might have created a custom project ID that you want to use in
the future. To preserve the URLs that use the project ID, such as an appspot.com
URL, delete selected resources inside the project instead of deleting the whole project.

If you plan to explore multiple tutorials and quickstarts, reusing projects can help you avoid
exceeding project quota limits.