Importing data from non-Cloud Spanner databases

This page describes how to prepare Avro files that you exported from
non-Cloud Spanner databases and then import those files into Cloud Spanner. If you
want to import a Cloud Spanner database that you previously exported, see
Importing Cloud Spanner Avro files.

Quota requirements

The quota requirements for import jobs, by GCP service, are
as follows:

Cloud Spanner: You must have enough nodes to support the
amount of data that you are importing. No additional nodes are
required to import a database, though you might need to add more nodes
so that your job finishes in a reasonable amount of time. See
Optimizing jobs for more details.

Cloud Storage: To import, you must have a bucket
containing your previously exported files.
You do not need to set a size for your bucket.

Cloud Dataflow: Import jobs are subject to the same CPU,
disk usage, and IP address
Compute Engine quotas as
other Cloud Dataflow jobs.

Compute Engine: Before running your import job, you must set up
initial quotas for
Compute Engine, which Cloud Dataflow uses. These quotas
represent the maximum number of resources that you allow
Cloud Dataflow to use for your job. Recommended starting values are:

CPUs: 200

In-use IP addresses: 200

Standard persistent disk: 50 TB

Generally, you do not have to make any other adjustments.
Cloud Dataflow provides autoscaling so that you only pay for the
actual resources used during the import. If your job can make use of
more resources, the Cloud Dataflow UI displays a warning icon. The
job should finish even if there is a warning icon.

Cloud IAM requirements

To import a database, you also need to have Cloud IAM roles with
sufficient permissions to use all of the services involved in an import
job. For information on granting roles and permissions, see
Applying IAM roles.

To import a database, you need the following roles:

At the GCP project level:

Cloud Spanner Viewer

Dataflow Admin

Storage Admin

At the Cloud Spanner database or instance level, or at the
GCP project level:

Cloud Spanner Reader

Cloud Spanner Database Admin (required only for import jobs)

Exporting data from a non-Cloud Spanner database to Avro files

The import process brings data in from Avro files located in a
Cloud Storage bucket. You can export data in Avro format from any
source and can use any available method to do so.

Choosing a region for your import job

You might want to choose a different region based on whether your
Cloud Storage bucket uses a regional or multi-regional configuration.
To avoid network egress charges, choose a region that overlaps
with your Cloud Storage bucket's location.

Regional bucket locations

If your Cloud Storage bucket location is regional,
choose the same region for your import job if that region is available to take
advantage of free network usage.

If the same region is not available, egress charges will apply. Refer to the
Cloud Storage network egress pricing to choose a region
that will incur the lowest network egress charges.

Multi-regional bucket locations

If your Cloud Storage bucket location is multi-regional,
choose one of the regions that make up the multi-regional location to take
advantage of free network usage.

If an overlapping region is not available, egress charges will apply. Refer to
the Cloud Storage network egress pricing to choose a
region that will incur the lowest network egress charges.

Viewing or troubleshooting jobs in the Cloud Dataflow UI

After you start an import job, you can view details of the job, including
logs, in the Cloud Dataflow section of the GCP Console.

Viewing Cloud Dataflow job details

To see details for a currently running job:

Navigate to the Database details page for the database.

Click View job details in Cloud Dataflow in the job status
message, which looks similar to the following:

In this situation,
increasing the quotas for
CPUs, in-use IP addresses, and standard persistent disk might shorten the
run time of the job, but you might incur more Compute Engine
charges.

Check the Cloud Spanner CPU utilization: If you see that the CPU
utilization for the instance is over 65%, you can
increase the number of nodes in that instance. The extra nodes add more
Cloud Spanner resources and the job should speed up, but you incur more
Cloud Spanner charges.

Factors affecting import job performance

Several factors influence the time it takes to complete an import job.

Cloud Spanner database size: Processing more data takes more time
and resources.

Cloud Spanner database schema (including indexes): The number of tables,
the size of the rows, and the number of secondary indexes influence the time it
takes to run an import job.

Data location: Data is transferred between Cloud Spanner and
Cloud Storage using Cloud Dataflow. Ideally all three
components are located in the same region. If the components are not in the same
region, moving the data across regions slows the job down.

Number of Cloud Dataflow workers: By using autoscaling,
Cloud Dataflow chooses the number of workers for the job depending on
the amount of work that needs to be done. The number of workers will, however,
be capped by the quotas for CPUs, in-use IP addresses, and standard persistent
disk. The Cloud Dataflow UI displays a warning icon if it encounters
quota caps. In this situation, progress is slower, but the job should still
complete.

Existing load on Cloud Spanner: An import job adds
significant CPU load on a Cloud Spanner instance.
If the instance already has a substantial existing load, then the job runs more
slowly.

Number of Cloud Spanner nodes: If the CPU utilization for the instance
is over 65%, then the job runs more slowly.