BigQuery for data warehouse practitioners

Updated September 2017

This article explains how to use
BigQuery
as a data warehouse, first mapping common data warehouse
concepts to those in BigQuery, and then describing how to perform
standard data-warehousing tasks in BigQuery.

Service model comparison

The following table maps standard data-warehouse concepts to those in
BigQuery:

Data warehouse

BigQuery

Data warehouse

The BigQuery service replaces the typical hardware
setup for a traditional data warehouse. That is, it serves as a
collective home for
all analytical data in an organization.

Data mart

Datasets are collections of tables that can be
divided along business lines or a given analytical domain. Each dataset
is tied to a Google Cloud project.

Tables and views function the same way in BigQuery
as they do in a traditional data warehouse.

Grants

Cloud Identity and Access Management (Cloud IAM) is
used to grant permission to perform specific actions in BigQuery.

Datasets

BigQuery organizes data tables into units called datasets. These datasets
are scoped to your Google Cloud project. When you reference a table from
the command line, in SQL queries, or in code, you refer to it by using the
following construct:

project.dataset.table

These multiple scopes—project, dataset, and table—can help you
structure your information logically.
You can use multiple datasets to separate tables pertaining to different
analytical domains, and you can use project-level scoping to isolate
datasets from each other according to your business needs.

Here is a structural overview of BigQuery:

Provisioning and system sizing

You don't need to provision resources before using BigQuery,
unlike many RDBMS systems. BigQuery allocates storage and query
resources dynamically based on your usage patterns.

Storage resources are allocated as you consume them and deallocated
as you remove data or drop tables.

Query resources are allocated according to query type and
complexity. Each query uses some number of slots, which are units of
computation that comprise a certain amount of CPU and RAM.

You don't have to make a minimum usage commitment to use
BigQuery. The service allocates and charges for resources based
on your actual usage. By default, all BigQuery customers have
access to 2,000 slots for query operations. You can also reserve a fixed number
of slots for your project. For details about which approach to use, see the
Costs
section.

Note: To start using BigQuery, you create a project to host your
data, and then you enable billing. For instructions, see the
BigQuery Quickstart.

Storage management

Internally, BigQuery stores data in a proprietary columnar format
called
Capacitor,
which has a number of benefits for data warehouse workloads.
BigQuery uses a proprietary format because it can evolve in
tandem with the query engine, which takes advantage of deep knowledge of the
data layout to optimize query execution. BigQuery uses query
access patterns to determine the optimal number of physical shards and how they
are encoded.

The data is physically stored on Google's distributed file system, called
Colossus,
which ensures durability by using erasure encoding
to store redundant chunks of the data on multiple physical disks. Moreover,
the data is replicated to multiple data centers.

You can also run BigQuery queries on data outside of
BigQuery storage, such as data stored in Cloud Storage,
Google Drive, or Bigtable, by using
federated data sources.
However, these sources are not optimized for BigQuery operations, so they
might not perform as well as data stored in BigQuery storage.

Maintenance

BigQuery is a fully-managed service, which means that the
BigQuery engineering team takes care of updates and maintenance
for you. Upgrades shouldn't require downtime or hinder system performance.

Many traditional systems require resource-intensive vacuum processes to run at
various intervals to reshuffle and sort data blocks and recover space.
BigQuery has no equivalent of the vacuum process, because the
storage engine continuously manages and optimizes how data is stored and
replicated. Also, because BigQuery doesn't use indexes on tables,
you don't need to rebuild indexes.

Backup and recovery

BigQuery addresses backup and disaster recovery at the service
level. Also, by maintaining a complete 7-day history of changes against your
tables, BigQuery allows you to query a point-in-time snapshot
of your data by using either
table decorators
or SYSTEM_TIME AS OF in the
FROM clause.
You can easily revert
changes without having to request a recovery from backups.
(When a table is explicitly deleted, its history is flushed after 7 days.)

Managing workflows

This section discusses administrative tasks, such as organizing datasets,
granting permissions, and onboarding work in BigQuery. The
section also discusses how to manage concurrent workloads, monitor the health of
your data warehouse, and audit user access.

Organizing datasets

You can segment datasets into separate projects based on class of data or
business unit, or consolidate them into common projects for simplicity.

You can invite a data analyst to collaborate on an existing dataset in any
limited role that you define. When a data analysts logs into the
BigQuery web UI,
they see only the datasets that have been shared with them across projects.
The activities that they can perform against datasets varies, based on their
role against each dataset.

Granting permissions

In a traditional RDBMS system, you grant permissions to view or modify tables by
creating SQL grants and applying them to a given user within the database
system. In addition, some RDBMS systems allow you to grant permissions to users
in an external directory, such as LDAP. The BigQuery model for
managing users and permissions resembles the latter model.

BigQuery provides
predefined roles
for controlling access to resources. You can also create custom
Cloud IAM roles
consisting of your defined set of permissions, and then assign
those roles to users or groups. You can assign a role to a Google email address
or to a
G Suite Group.

An important aspect of operating a data warehouse is allowing shared but
controlled access against the same data to different groups of users. For
example, finance, HR, and marketing departments all access the same tables, but
their levels of access differ. Traditional data warehousing tools make this
possible by enforcing row-level security. You can achieve the same results in
BigQuery by defining
authorized views
and
row-level permissions.

Onboarding

Traditionally, onboarding new data analysts involved significant
lead time. To enable analysts to run simple queries, you
had to show them where data sources resided and set up ODBC connections and
tools and access rights. Using Google Cloud, you can greatly accelerate
an analyst's time to productivity.

To onboard an analyst on Google Cloud, you
grant access to relevant project(s),
introduce them to the Google Cloud Console and BigQuery web UI,
and share some
queries to help them get acquainted with the data:

The
Cloud Console
provides a centralized view of all assets in your Google Cloud
environment. The most relevant asset to data analysts might be
Cloud Storage buckets,
where they can collaborate on files.

The BigQuery web UI presents the list of datasets that the
analyst has access to. Analysts can perform tasks in the
Cloud Console according to the role you grant them, such as
viewing metadata, previewing data, executing, and saving and sharing
queries.

Managing workloads and concurrency

BigQuery limits the maximum rate of incoming requests and
enforces appropriate quotas on a per-project basis. Specific policies vary
depending on resource
availability, user profile, service usage history, and other factors. For
details, see the
BigQuery quota policy.

BigQuery offers two types of
query priorities:
interactive and batch. By default, BigQuery runs interactive
queries, which means that the query is executed as soon as possible. Interactive
queries count towards
query quotas.
Batch queries are queued and executed as soon as idle resources are available,
usually within a few minutes.

BigQuery doesn't support fine-grained prioritization of
interactive or batch queries. Given the speed and scale at which
BigQuery operates, many traditional workload issues aren't
applicable. If you need explicit query prioritization, you can
separate your sensitive workloads into a project with an explicit number
of reserved slots. Contact your Google representative to assist in becoming
a flat-rate customer.

Monitoring and auditing

You can monitor BigQuery using
Monitoring,
where various charts and alerts are defined based on
BigQuery metrics.
For example, you can
monitor system throughput using the Query Time metric or visualize query demand
trends based on the Slots Allocated metric. When you need to plan ahead for a
demanding query, you can use the Slots Available metric. To stay proactive about
system health, you can
create alerts
based on thresholds that you define. Monitoring provides a self-service
web-based portal. You can control access to the portal with a
Monitoring Workspace.

BigQuery automatically creates audit logs of user actions.
You can export audit logs to another BigQuery dataset in a batch or as a data
stream and use your preferred analysis tool
to visualize the logs. For details, see
Analyzing audit logs using BigQuery.

Managing data

This section discusses schema design considerations,
denormalization,
how partitioning works, and
methods for loading data into BigQuery. The section concludes
with a look at handling change in the warehouse while maintaining zero
analysis downtime.

Designing schema

Follow these general guidelines to design the optimal schema for
BigQuery:

Denormalize a dimension table that is larger than 10 gigabytes, unless you see
strong evidence that data manipulation, UPDATE and DELETE operation,
costs outweigh the
benefits of optimal queries.

Keep a dimension table that is smaller than 10 gigabytes normalized, unless
the table rarely goes through UPDATE and DELETE operations.

Take full advantage of nested and repeated fields in denormalized tables.

Denormalization

The conventional method of denormalizing data involves writing a fact,
along with all its dimensions, into a flat table structure. For example, for
sales transactions, you would write each fact to a record,
along with the accompanying dimensions, such as order and customer information.

In contrast, the preferred method for denormalizing data takes advantage of
BigQuery's native support for nested and repeated structures in
JSON or Avro input data. Expressing records using nested and repeated structures
can provide a more natural representation of the underlying data. In the case of
the sales order, the outer part of a JSON structure contains the order and
customer information, and the inner part of the structure contains the individual
line items of the order, which are represented as nested, repeated
elements.

Expressing records by using nested and repeated fields simplifies
data load using JSON or Avro files. After you've created such a
schema, you can perform SELECT, INSERT, UPDATE, and DELETE operations on
any individual fields using a dot notation, for example, Order.Item.SKU. For
examples, see the
BigQuery documentation.

Advantages of denormalization

BigQuery is essentially an analytical engine. It supports
DML
actions, but it isn't meant to be used as an online transaction processing
(OLTP) store. The discussion about
Changing data
provides guidelines for dealing with changes while maintaining zero analysis
downtime and delivering optimal online analytical processing (OLAP) performance.
While normalized or partially normalized data structures, such as star schema or
snowflake, are suitable for update/delete operations, they aren't optimal for
OLAP workloads. When performing OLAP operations on normalized tables, multiple
tables have to be JOINed to perform the required aggregations.
JOINs
are possible with BigQuery and sometimes recommended on small
tables. However, they are typically not as performant as
denormalized structures.

The following graph compares query performance using JOINs to simple filters in
relation to table size. Query performance shows a much steeper
decay in presence of JOINs.

Disadvantages of denormalization

Denormalized schemas aren't storage-optimal, but BigQuery's low
cost of storage addresses concerns about storage inefficiency. You can contrast
costs against gains in query speed to see why storage isn't a
significant factor.

One challenge when you work with denormalized schema is maintaining
data integrity. Depending on the frequency of change and how widespread it is,
maintaining data integrity can require increased machine time and sometimes
human time for testing and verification.

Partitioning tables

BigQuery supports
partitioning tables by date.
You enable partitioning during the table-creation process. BigQuery
creates new date-based partitions automatically, with no need for additional
maintenance. In addition, you can specify an expiration time for data in the
partitions.

New data that is inserted into a partitioned table is written to the raw
partition at the time of insert. To explicitly control which partition the data
is loaded to, your load job can specify a particular date partition.

Loading data

Before data can be loaded into BigQuery for analytical workloads,
it is typically stored in a
Cloud Storage product
and in a format that is native to its origin. During early stages of migration
to Google Cloud, the common pattern is to use existing extract, transform,
and load (ETL) tools to transform data into the ideal schema for
BigQuery. After data is transformed, it is transferred to
Cloud Storage as CSV, JSON, or Avro files, and from there loaded into
BigQuery by using
load jobs
or streaming.
Alternatively, you can transfer files to Cloud Storage in the schema
that is native to the existing on-premises data storage, loaded into a set of
staging tables in BigQuery and then transformed into
the ideal schema for BigQuery by using BigQuery
SQL commands. These two approaches are visualized here:

As you expand your footprint in Google Cloud, you will probably capture
your source data directly in
Bigtable,
Datastore,
or Cloud Spanner
and use Dataflow
to ETL data into BigQuery in batch or streams.

Using load jobs

This section assumes that your data is in Cloud Storage as a collection
of files in a supported file format. For more information about each data
format, as well as specific requirements and features to consider when choosing
a format, see
BigQuery data formats.

In addition to CSV, you can also use data files with delimiters other than
commas by using the --field_delimiter flag. For details, see
bq load flags.

BigQuery supports loading gzip
compressed
files. However, loading compressed files isn't as fast as loading uncompressed
files. For time-sensitive scenarios or scenarios in which transferring
uncompressed files to Cloud Storage is bandwidth- or time-constrained,
conduct a quick loading test to see which alternative works best.

Because load jobs are asynchronous, you don't need to maintain a client
connection while the job is being executed. More importantly, load jobs
don't affect your other BigQuery resources.

A load job creates a destination table if one doesn't already exist.

BigQuery determines the data schema as follows:

If your data is in Avro format, which is self-describing,
BigQuery can determine the schema directly.

You can specify a schema explicitly by passing the schema as an argument to the
load job.
Ongoing load jobs can append to the same table using the same procedure as the
initial load, but do not require the schema to be passed with each job.

If your CSV files always contain a header row that needs to be ignored after
the initial load and table creation, you can use the --skip_leading_rows flag
to ignore the row. For details, see
bq load flags.

BigQuery sets daily limits on the number and size of load jobs
that you can perform per project and per table. In addition,
BigQuery sets limits on the sizes of individual load files and
records. For details, see
Quota policy.

Note: Because tables are set to a hard limit of 1,000 load jobs per day,
micro-batching isn't advised. To achieve efficient high-volume or real-time
loading of data, use streaming inserts in place of micro-batching.

You can launch load jobs through the BigQuery web UI. To automate
the process, you can set up a
Cloud Functions
to listen to a
Cloud Storage event
that is associated with arriving new files in a given bucket and launch the
BigQuery load job.

Using streaming inserts

For an alternate and complementary approach, you can also stream data
directly into BigQuery. Streamed data is made available
immediately and can be queried alongside existing table data in real-time.

For situations that can benefit from real-time information, such as
fraud detection or monitoring system metrics, streaming can be a
significant differentiator. However, unlike load jobs, which are free in
BigQuery, there is a charge
for streaming data. Therefore, it's important to use streaming in situations
where the benefits outweigh the costs.

Handling change

Many data warehouses operate under strict Service Level Agreements (SLAs),
demanding little to no downtime. While Google handles BigQuery's
uptime, you control the availability and responsiveness of your datasets with
your approach to reflecting change in the data.

All table modifications in BigQuery are
ACID
compliant. This applies to DML operations, queries with destination tables, and
load jobs. A table that goes through inserts, updates, and deletes while serving
user queries handles the concurrency gracefully and transitions from one state
to the next in an atomic fashion. Therefore, modifying a table doesn't require
downtime. However, your internal process might require a testing and validation
phase before making newly refreshed data available for analysis. Also, because
DML operations compete against analytical workload over slots, you might prefer
to isolate them. For these reasons, you might introduce downtime. This article
uses the term "analysis downtime" to avoid confusion with BigQuery service
downtime.

You can apply most of the old and proven techniques for handling analysis
downtime. This section expands on some of the known challenges and remedies.

Note: Using BigQuery as an OLTP store is considered an
anti-pattern. Because OLTP stores have a high volume of updates and deletes,
they are a mismatch for the data warehouse use case. Use this
flowchart to help you
decide which storage option fits your use case best.

Sliding time window

A traditional data warehouse, unlike a data lake, retains data only for a fixed
amount of time, for example, the last 5 years. On each update cycle, new data is
added to the warehouse and the oldest data rolls off, keeping the duration
fixed. For the most part, this concept was employed to work around the
limitations of older technologies.

BigQuery is built for scale and can scale out as the size of the
warehouse grows, so there is no need to delete older data. By keeping the entire
history, you can deliver more insight on your business. If the storage cost is a
concern, you can take advantage of
BigQuery's long term storage pricing
by archiving older data and
using it for special analysis when the need arises. If you still have good
reasons for dropping older data, you can use BigQuery's native
support for
date-partitioned tables
and
partition expiration.
In other words, BigQuery can automatically delete older data.

Changing schemas

While a data warehouse is designed and developed, it is typical to tweak table
schemas by adding, updating, or dropping columns or even adding or dropping
whole tables. Unless the change is in the form of an added column or table, it
could break saved queries and reports that reference a deleted table, a renamed
column, and so on.

After the data warehouse is in production, such changes go through strict change
control. You might decide to handle minor schema changes during an analysis
downtime, but for the most part reflecting schema changes are scheduled as
version upgrades. You design, develop, and test the upgrade in parallel while
the previous version of the data warehouse is serving the analysis workloads.
You follow the same approach in applying schema changes to a
BigQuery data warehouse.

Slowly changing dimensions

A normalized data schema minimizes the impact of
Slowly Changing Dimensions (SCD)
by isolating the change in the dimension tables. It is generally favorable over
a denormalized schema, where SCD can cause widespread updates to the flat fact
table. However, as discussed in the schema design section, use normalization
carefully for BigQuery.

When it comes to SCD, there is no one-size-fits-all solution. It is important
to understand the nature of the change and apply the most relevant solution
or combinations of solutions to your problem. The remainder of this section
outlines a few solutions and how to apply them to SCD types.

Note: It is important to address slowly changing dimensions in the context of
ideal schema for BigQuery. Often you must sacrifice efficient SCD
handling in exchange for optimized query performance or the opposite.

Technique 1: view switching

This technique is based on two views of the data: "main" vs. "shadow". The trick
is to hide the actual table and expose the "main" view to the users. On update
cycles, the "shadow" view is created/updated and goes through data correctness
tests while the users work against the "main" view. At switchover time, the
"main" view is swapped with "shadow." The old "main" and now "shadow" could be
torn down until the next update cycle or kept around for some workflows
depending on the rules and processes defined by the organization.

The two views could be based on a common table and differentiated by a column,
for example, "view_type," or based on distinct tables. The former method is not
recommended, because DML operations against the "shadow" view of the table could
slow down user queries against the "main" view without offering any real
benefits.

While view switching offers zero analysis downtime, it has a higher cost because
during the update cycle, two copies of the data exist. More importantly, if
update cycles happen at a higher rate than 90 days, this approach could prevent
your organization from taking advantage of
long-term storage pricing.
Ninety days is based on the pricing policy at the time of this writing. Be sure
to check the latest policy.

Sometimes different segments of data change at their own pace. For instance,
sales data in North America is updated on a daily basis, while data for Asia
Pacific is updated on a biweekly basis. In such situations, it is best to
partition the table based on the driving factor for the change, Country in this
example. View switching is then applied to the impacted partitions and not the
entire data warehouse. At the time of this writing, you can only partition based
on a custom data attribute, such as Country, by explicitly splitting the data
into multiple tables.

Technique 2: in-place partition loading

When the change in data can be isolated by a partition and brief analysis
downtime is tolerated, view switching might be overkill. Instead, data for the
affected partitions can be staged in other BigQuery tables or
exported to files in Cloud Storage, where they can be replaced during
analysis downtime.

To replace data in a target partition with data from a query of another table:

Technique 3: update data masking

A small and frequently changing dimension is a prime candidate for
normalization. In this technique, updates to such a dimension are staged in an
isolated table or view that is conditionally joined with the rest of the data:

SCD Type 1: overwrite attribute value

Type 1 SCD overwrites the value of an attribute with new data without
maintaining the history. For example, if the product "awesome moisturizer cream"
was part of the "health and beauty" category and is now categorized as
"cosmetics", the change looks like this:

Before:

PRD_SK

PRD_ID

PRD_DESC

PRD_CATEGORY

123

ABC

awesome moisturizer cream - 100 oz

health and beauty

After:

PRD_SK

PRD_ID

PRD_DESC

PRD_CATEGORY

123

ABC

awesome moisturizer cream - 100 oz

health and beauty
cosmetics

If the attribute is in a normalized dimension table, the change is very
isolated. You simply update the impacted row in the dimension table. For
smaller dimension tables with frequent type 1 updates, use
Technique 3: update data masking.

SCD Type 2: change attribute value and maintain history

This method tracks unlimited historical data by creating multiple records
for a given
natural key
with separate
surrogate keys.
For example, the same change that is illustrated in SCD type 1 would be handled
as below:

Before:

PRD_SK

PRD_ID

PRD_DESC

PRD_CATEGORY

START_DATE

END_DATE

123

ABC

awesome moisturizer cream - 100 oz

health and beauty

31-Jan-2009

NULL

After:

PRD_SK

PRD_ID

PRD_DESC

PRD_CATEGORY

START_DATE

END_DATE

123

ABC

awesome moisturizer cream - 100 oz

health and beauty

31-Jan-2009

18-JUL-2017

124

ABC

awesome moisturizer cream - 100 oz

cosmetics

19-JUL-2017

NULL

If the attribute is in a normalized dimension table, the change is isolated.
You simply update the previous row and add a new one in the dimension table.
For smaller dimension tables with frequent type 1 updates, use
Technique 3: update data masking.

If the attribute is embedded in the fact table in a denormalized fashion,
the situation can be more favorable, as long as you don't maintain
explicit start and end dates for the value and instead rely on the transaction
dates. Because the previous value remains true for the date and time the
previous transactions occurred, you don't need to change previous fact
table rows. The fact table would look like this:

TRANSACTION_DATE

PRD_SK

PRD_ID

PRD_DESC

PRD_CATEGORY

UNITS

AMOUNT

18-JUL-2017

123

ABC

awesome moisturizer cream - 100 oz

health and beauty

2

25.16

19-JUL-2017

124

ABC

awesome moisturizer cream - 100 oz

cosmetics

1

13.50

Querying data

BigQuery supports standard SQL queries and is compatible with
ANSI SQL 2011. BigQuery's
SQL reference
provides a comprehensive description of all functions, operators, and regex
capabilities that are supported.

Note: Prior to supporting standard SQL, BigQuery supported an
alternate SQL version that is now referred to as Legacy SQL. We recommend
using the updated SQL standard in your queries. For more information, see
Enabling standard SQL.

Because BigQuery supports nested and repeated fields as part of
the data model, its SQL support has been extended to specifically support these
field types.
For example, using the
GitHub public dataset,
you could issue the
UNNEST
command, which lets you iterate over a repeated field:

SELECT
name, count(1) as num_repos
FROM
`bigquery-public-data.github_repos.languages`, UNNEST(language)
GROUP BY name
ORDER BY num_repos
DESC limit 10

Interactive queries

The BigQuery web UI
allows interactive querying of datasets and provides a consolidated view of
datasets across projects that you have access to. The console also provides
several useful features such as saving and sharing ad-hoc queries, tuning and
editing historical queries, exploring tables and schemas, and gathering table
metadata. Refer to the
BigQuery web UI
for more details.

Automated queries

It is a common practice to automate execution of queries based on a
schedule/event and cache the results for later consumption.

If you are using Airflow to orchestrate other automated activities and already
familiar with the tool, use
Apache Airflow API for BigQuery
for this purpose.
This blog post
walks you through the process of installing Airflow and creating a workflow
against BigQuery.

For simpler orchestrations, you can rely on cron jobs.
This blog post
shows you how to encapsulate a query as an App Engine app and run it
as a scheduled cron job.

Query optimization

Each time BigQuery executes a query, it executes a full-column
scan. BigQuery doesn't use or support indexes. Because
BigQuery performance and query costs are based on the amount of
data scanned during a query, design your queries so that they reference only the
columns that are relevant to the query. When using date-partitioned tables,
ensure only the relevant partitions are scanned. You can achieve this by using
partition
filters based on PARTITIONTIME or PARTITIONDATE.

To understand the performance characteristics after a query executes, take a
look at the detailed
query plan explanation.
The explanation breaks down the stages that the query went through,
the number of input/output rows handled at each stage, and the timing
profile within each stage. Using the results from the explanation can help you
understand and optimize your queries.

External sources

You can run queries on data that exists outside of BigQuery by
using
federated data sources,
but this approach has performance implications. Use federated data sources
only if the data must be maintained externally. You can also use query
federation to perform ETL from an external source to BigQuery. This approach
allows you to define ETL using familiar SQL syntax.

User-defined functions

BigQuery also supports
user-defined functions
(UDFs) for queries that exceed the complexity of SQL. UDFs allow you to extend
the built-in SQL functions; they take a list of values, which can be arrays
or structs, and return a single value, which can also be an array or struct.
UDFs are written in JavaScript and can include external resources, such as
encryption or other libraries.

Query sharing

BigQuery allows collaborators to save and share queries between
team members. This feature can be especially useful in data exploration
exercises or as a means of coming up to speed on a new dataset or query pattern.
For more information, see
Saving and sharing queries.

Analyzing data

This section presents various ways that you can connect to
BigQuery and analyze the data. To take full advantage of
BigQuery as an analytical engine, you should store the data in
BigQuery storage. However, your specific use case might
benefit from analyzing external sources either by themselves or JOINed with
data in BigQuery storage.

Off-the-shelf tools

Google Data Studio,
available in beta at the time of this writing, as well as many
partner tools
that are already integrated with BigQuery, can be used to draw
analytics from BigQuery and build sophisticated interactive data
visualizations.

Custom development

To build custom applications and platforms on top of BigQuery,
you can use
client libraries,
which are available for most common programming languages, or you can use
BigQuery's REST API
directly.

For a concrete example, refer to
this tutorial,
which uses Python libraries to connect to BigQuery and generate
custom interactive dashboards.

Note: All the methods for connecting to BigQuery essentially
provide a wrapper around BigQuery's REST API. All connections to
the BigQuery API are encrypted by using HTTPS, and enforce
permissions by using
IAM policies.

Third-party connectors

To connect to BigQuery from an application that isn't natively
integrated with BigQuery at the API level, you can use the
BigQuery JDBC and ODBC drivers.
The drivers provide a bridge to interact with BigQuery for legacy
applications or applications that cannot be easily modified, such as
Microsoft Excel.
Although ODBC and JDBC support interacting with BigQuery using
SQL, the drivers aren't as expressive as dealing with the API directly.

There are three primary cost dimensions for BigQuery: loading,
storage, and query costs. This section discusses each dimension in detail.

Storing data

Storage pricing is prorated per MB/s.

If a table hasn't been edited for 90 consecutive days, it is categorized as
long-term storage and the price of storage for that table automatically drops by
50 percent to $0.01 per GB per month. There is no degradation of performance,
durability, availability, or any other functionality when a table is considered
long-term storage. When the data in a table is modified, BigQuery
resets the timer on the table, and any data in the table returns to the normal
storage price. Actions that don't directly manipulate the data, such as querying
and creating views, don't reset the timer.

Loading data

You can load data into BigQuery by using a conventional load job,
at no charge. After data is loaded, you pay for the storage as discussed above.

Streaming inserts are charged based on the amount of data that is being
streamed. For details, see costs of streaming inserts listed under
BigQuery storage pricing.

Querying data

For queries, BigQuery offers two pricing models: on-demand and
flat-rate.

Note: In a multi-project situation where data is hosted in one project and made
available for queries to users of other projects, the cost of storage and
streaming is incurred in the hosting project, but the cost of queries is
incurred in the project where the query is issued from.

On-demand pricing

In the on-demand model, BigQuery charges for the amount of data
accessed during query execution. Because BigQuery uses a
columnar storage format, only the columns relevant to your query are accessed.
If you only run reports on a weekly or monthly basis, and you've performed queries
on less than 1 TB of your data, you might find the cost of queries on your bill
is very low. For more details on how queries are charged, see
BigQuery query pricing.

To help determine how much data any given query is going to scan beforehand, you
can use the query validator in the web UI. In the case of
custom development, you can set the dryRun flag in the API request and have
BigQuery not run the job. Instead, return with statistics about
the job, such as how many bytes would be processed. Refer to the
query API
for more details.

Flat-rate pricing

Customers who prefer more consistency of monthly expenses can choose to
enable flat-rate pricing. To learn more, see
BigQuery flat-rate pricing.