Loading ORC data from Cloud Storage

This page provides an overview of loading ORC data from Cloud Storage into
BigQuery.

ORC is an
open source column-oriented data format that is widely used in the Apache Hadoop
ecosystem.

When you load ORC data from Cloud Storage, you can load the data into a new
table or partition, or you can append to or overwrite an existing table or
partition. When your data is loaded into BigQuery, it is
converted into columnar format for Capacitor
(BigQuery's storage format).

When you load data from Cloud Storage into a BigQuery table,
the dataset that contains the table must be in the same regional or multi-
regional location as the Cloud Storage bucket.

ORC schemas

When you load ORC files into BigQuery, the table schema is
automatically retrieved from the self-describing source data. When
BigQuery retrieves the schema from the source data, the
alphabetically last file is used.

For example, you have the following ORC files in Cloud Storage:

gs://mybucket/00/
a.orc
z.orc
gs://mybucket/01/
b.orc

This command loads all of the files in a single CLI command (as a
comma-separated list), and the schema is derived from mybucket/01/b.orc:

When BigQuery detects the schema, some ORC data types are
converted to BigQuery data types to make them compatible with
BigQuery SQL syntax. All fields in the detected schema are
NULLABLE. For more information, see
ORC conversions.

When you load multiple ORC files that have different schemas, identical
fields (with the same name and same nested level) specified in multiple
schemas must map to the same converted BigQuery data type in
each schema definition.

ORC compression

Required permissions

When you load data into BigQuery, you need permissions to run a
load job and permissions that allow you to load data into new or existing
BigQuery tables and partitions. If you are loading data from
Cloud Storage, you also need permissions to access to the bucket that
contains your data.

BigQuery permissions

At a minimum, the following permissions are required to load data into
BigQuery. These permissions are required if you are loading data
into a new table or partition, or if you are appending or overwriting a table or
partition.

bigquery.tables.create

bigquery.tables.updateData

bigquery.jobs.create

The following predefined Cloud IAM roles include both
bigquery.tables.create and bigquery.tables.updateData permissions:

bigquery.dataEditor

bigquery.dataOwner

bigquery.admin

The following predefined Cloud IAM roles include bigquery.jobs.create
permissions:

bigquery.user

bigquery.jobUser

bigquery.admin

In addition, if a user has bigquery.datasets.create permissions, when that
user creates a dataset, they are granted bigquery.dataOwner access to it.
bigquery.dataOwner access gives the user the ability to create and
update tables in the dataset via a load job.

For more information on Cloud IAM roles and permissions in
BigQuery, see Access control.

Cloud Storage permissions

In order to load data from a Cloud Storage bucket, you must be granted
storage.objects.get permissions. If you are using a URI wildcard,
you must also have storage.objects.list permissions.

The predefined Cloud IAM role storage.objectViewer
can be granted to provide both storage.objects.get and storage.objects.list
permissions.

Console

In the navigation panel, in the Resources section, expand your
project and select a dataset.

On the right side of the window, in the details panel, click
Create table. The process for loading data is the same as the
process for creating an empty table.

On the Create table page, in the Source section:

For Create table from, select Cloud Storage.

In the source field, browse to or enter the Cloud Storage URI.
Note that you cannot include multiple URIs in the
GCP Console, but
wildcards
are supported. The Cloud Storage bucket must be in the same location
as the dataset that contains the table you're creating.

For File format, select ORC.

On the Create table page, in the Destination section:

For Dataset name, choose the appropriate dataset.

Verify that Table type is set to Native table.

In the Table name field, enter the name of the table you're
creating in BigQuery.

In the Schema section, no action is necessary. The schema is
self-described in ORC files.

(Optional) To partition the table, choose your options in the
Partition and cluster settings:

To create a partitioned table,
click No partitioning, select Partition by field and choose a
DATE or TIMESTAMP column. This option is unavailable if your schema
does not include a DATE or TIMESTAMP column.

(Optional) For Partitioning filter, click the Require partition
filter box to require users to include a WHERE clause that specifies the
partitions to query. Requiring a partition filter may reduce cost and
improve performance. For more information, see
Querying partitioned tables.
This option is unavailable if No partitioning is selected.

(Optional) To cluster the
table, in the Clustering order box, enter between one and four field
names. Currently, clustering is supported only for partitioned tables.

(Optional) Click Advanced options.

For Write preference, leave Write if empty selected. This
option creates a new table and loads your data into it.

For Number of errors allowed, accept the default value of 0 or
enter the maximum number of rows containing errors that can be ignored.
If the number of rows with errors exceeds this value, the job will
result in an invalid message and fail.

Note: When you load data into an empty table by using the
GCP Console, you cannot add a label, description, table
expiration, or partition expiration.

After the table is created, you can update the table's expiration,
description, and labels, but you cannot add a partition expiration after a
table is created using the GCP Console. For more information, see
Managing tables.

Classic UI

In the navigation panel, hover on a dataset, click the down arrow
icon , and
click Create new table. The process for loading data is the same as the
process for creating an empty table.

On the Create Table page, in the Source Data section:

Click Create from source.

For Location, select Cloud Storage and in the source
field, enter the Cloud Storage URI. Note that you
cannot include multiple URIs in the BigQuery web UI, but wildcards
are supported. The Cloud Storage bucket must be in the same
location as the dataset that contains the table you're creating.

For File format, select ORC.

In the Destination Table section:

For Table name, choose the appropriate dataset, and in the table
name field, enter the name of the table you're creating in
BigQuery.

Verify that Table type is set to Native table.

In the Schema section, no action is necessary. The schema is
self-described in ORC files.

(Optional) In the Options section:

For Number of errors allowed, accept the default value of 0 or
enter the maximum number of rows containing errors that can be ignored.
If the number of rows with errors exceeds this value, the job will
result in an invalid message and fail.

For Write preference, leave Write if empty selected. This
option creates a new table and loads your data into it.

To partition the table:

For Partitioning Type, click None and choose Day.

For Partitioning Field:

To create a partitioned table,
choose a DATE or TIMESTAMP column. This option is unavailable if
your schema does not include a DATE or TIMESTAMP column.

Click the Require partition filter box to require users
to include a WHERE clause that specifies the partitions to query.
Requiring a partition filter may reduce cost and improve performance.
For more information, see Querying partitioned tables.
This option is unavailable if Partitioning type is set to
None.

To cluster the
table, in the Clustering fields box, enter between one and four field
names.

Note: When you create an empty table by using the web UI,
you cannot add a label, description, table expiration, or partition
expiration.

After the table is created, you can update the table's expiration,
description, and labels, but you cannot add a partition expiration after a
table is created using the web UI. For more information, see
Managing tables.

CLI

Use the bq load command, specify ORC as the source_format, and include a
Cloud Storage URI.
You can include a single URI, a comma-separated list of URIs or a URI
containing a wildcard.

(Optional) Supply the --location flag and set the value to your
location.

Other optional flags include:

--max_bad_records: An integer that specifies the maximum number of bad
records allowed before the entire job fails. The default value is 0. At
most, five errors of any type are returned regardless of the
--max_bad_records value.

--time_partitioning_type: Enables time-based partitioning on a table and
sets the partition type. Currently, the only possible value is DAY which
generates one partition per day. This flag is optional when you create a
table partitioned on a DATE or TIMESTAMP column.

--time_partitioning_expiration: An integer that specifies (in seconds)
when a time-based partition should be deleted. The expiration time evaluates
to the partition's UTC date plus the integer value.

--require_partition_filter: When enabled, this option requires users
to include a WHERE clause that specifies the partitions to query.
Requiring a partition filter may reduce cost and improve performance.
For more information, see Querying partitioned tables.

--clustering_fields: A comma-separated list of up to four column names
used to create a clustered table.
This flag can only be used with partitioned tables.

--destination_kms_key: The Cloud KMS key for encryption of the
table data.

location is your location. The --location flag is optional.
For example, if you are using BigQuery in the Tokyo region,
you can set the flag's value to asia-northeast1. You can set a default
value for the location using the .bigqueryrc file.

API

(Optional) Specify your location in
the location property in the jobReference section of the job resource.

The source URIs property must be fully-qualified, in the format
gs://bucket/object.
Each URI can contain one '*' wildcard character.

Specify the ORC data format by setting the sourceFormat property to
ORC.

To check the job status, call
jobs.get(job_id*),
where job_id is the ID of the job returned by the initial
request.

If status.state = DONE, the job completed successfully.

If the status.errorResult property is present, the request failed,
and that object will include information describing what went wrong.
When a request fails, no table is created and no data is loaded.

If status.errorResult is absent, the job finished successfully,
although there might have been some non-fatal errors, such as problems
importing a few rows. Non-fatal errors are listed in the returned job
object's status.errors property.

API notes:

Load jobs are atomic and consistent; if a load job fails, none of the data
is available, and if a load job succeeds, all of the data is available.

As a best practice, generate a unique ID and pass it as
jobReference.jobId when calling jobs.insert to create a load job. This
approach is more robust to network failure because the client can poll or
retry on the known job ID.

Calling jobs.insert on a given job ID is idempotent. You can retry as
many times as you like on the same job ID, and at most one of those
operations will succeed.

Console

In the navigation panel, in the Resources section, expand your
project and select a dataset.

On the right side of the window, in the details panel, click
Create table. The process for appending and overwriting data in a load
job is the same as the process for creating a table in a load job.

On the Create table page, in the Source section:

For Create table from, select Cloud Storage.

In the source field, browse to or
enter the Cloud Storage URI. Note that you cannot
include multiple URIs in the BigQuery web UI, but wildcards
are supported. The Cloud Storage bucket must be in the same location
as the dataset that contains the table you're appending or overwriting.

For File format, select ORC.

On the Create table page, in the Destination section:

For Dataset name, choose the appropriate dataset.

In the Table name field, enter the name of the table you're
appending or overwriting in BigQuery.

Verify that Table type is set to Native table.

In the Schema section, no action is necessary. The schema is
self-described in ORC files.

Note: It is possible to modify the table's schema when you append or
overwrite it. For more information on supported schema changes during a
load operation, see
Modifying table schemas.

For Partition and cluster settings, leave the default values. You
cannot convert a table to a partitioned or clustered table by appending or
overwriting it, and the GCP Console does not support
appending to or overwriting partitioned or clustered tables in a load job.

Click Advanced options.

For Write preference, choose Append to table or Overwrite
table.

For Number of errors allowed, accept the default value of 0 or
enter the maximum number of rows containing errors that can be ignored.
If the number of rows with errors exceeds this value, the job will
result in an invalid message and fail.

Classic UI

In the navigation panel, hover on a dataset, click the down arrow
icon , and
click Create new table. The process for appending and overwriting data
in a load job is the same as the process for creating a table in a load job.

On the Create Table page, in the Source Data section:

For Location, select Cloud Storage and in the source
field, enter the Cloud Storage URI. Note that you
cannot include multiple URIs in the UI, but wildcards
are supported. The Cloud Storage bucket must be in the same
location as the dataset that contains the table you're appending or
overwriting.

For File format, select ORC.

On the Create Table page, in the Destination Table section:

For Table name, choose the appropriate dataset, and in the table
name field, enter the name of the table you're appending or
overwriting.

Verify that Table type is set to Native table.

In the Schema section, no action is necessary. Schema information is
self-described in ORC files.

Note: It is possible to modify the table's schema when you append or
overwrite it. For more information on supported schema changes during a
load operation, see
Modifying table schemas.

In the Options section:

For Number of errors allowed, accept the default value of 0 or
enter the maximum number of rows containing errors that can be ignored.
If the number of rows with errors exceeds this value, the job will
result in an invalid message and fail.

For Write preference, choose Append to table or Overwrite
table.

Leave the default values for Partitioning Type, Partitioning
Field, Require partition filter, and Clustering Fields. You
cannot convert a table to a partitioned or clustered table by appending
or overwriting it, and the web UI does not support appending to
or overwriting partitioned or clustered tables in a load job.

CLI

Enter the bq load command with the --replace flag to overwrite the
table. Use the --noreplace flag to append data to the table. If no flag is
specified, the default is to append data. Supply the --source_format flag
and set it to ORC. Because ORC schemas are automatically retrieved
from the self-describing source data, you do not need to provide a schema
definition.

Note: It is possible to modify the table's schema when you append or
overwrite it. For more information on supported schema changes during a
load operation, see
Modifying table schemas.

(Optional) Supply the --location flag and set the value to your
location.

Other optional flags include:

--max_bad_records: An integer that specifies the maximum number of bad
records allowed before the entire job fails. The default value is 0. At
most, five errors of any type are returned regardless of the
--max_bad_records value.

--destination_kms_key: The Cloud KMS key for encryption of the
table data.

ORC supports nanosecond precision, but BigQuery converts
sub-microsecond values to microseconds when the data is read.

decimal

NUMERIC or STRING

NUMERIC types are exact numeric values with 38 digits of precision and
9 decimal digits of scale. See
NUMERIC type
for details. If a decimal type in an ORC schema has its scale no more
than 9 and its precision - scale no more than 29, it's converted to
NUMERIC. Otherwise, it's converted to STRING. If a decimal type is
converted to STRING, a warning message is returned.

Complex types

An ORC map<K,V> field is converted to a repeated RECORD
that contains two fields: a key of the same data type as
K, and a value of the same data type as V. Both
fields are NULLABLE.

list

repeated fields

Nested lists and lists of maps are not supported.

union

RECORD

When union only has one variant, it's converted to a NULLABLE
field.

Otherwise a union is converted to a RECORD with a list of NULLABLE
fields. The NULLABLE fields have suffixes such as field_0, field_1,
and so on. Only one of these fields is assigned a value when the
data is read.

Column names

A column name must contain only letters (a-z, A-Z), numbers (0-9), or
underscores (_), and it must start with a letter or underscore. The maximum
column name length is 128 characters. A column name cannot use any of the
following prefixes:

_TABLE_

_FILE_

_PARTITION

Duplicate column names are not allowed even if the case differs. For example,
a column named Column1 is considered identical to a column named column1.