Introduction to designing tables in Azure SQL Data Warehouse

01/18/2018

10 minutes to read

Contributors

In this article

Learn key concepts for designing tables in Azure SQL Data Warehouse.

Determining table category

A star schema organizes data into fact and dimension tables. Some tables are used for integration or staging data before it moves to a fact or dimension table. As you design a table, decide whether the table data belongs in a fact, dimension, or integration table. This decision informs the appropriate table structure and distribution.

Fact tables contain quantitative data that are commonly generated in a transactional system, and then loaded into the data warehouse. For example, a retail business generates sales transactions every day, and then loads the data into a data warehouse fact table for analysis.

Dimension tables contain attribute data that might change but usually changes infrequently. For example, a customer's name and address are stored in a dimension table and updated only when the customer's profile changes. To minimize the size of a large fact table, the customer's name and address do not need to be in every row of a fact table. Instead, the fact table and the dimension table can share a customer ID. A query can join the two tables to associate a customer's profile and transactions.

Integration tables provide a place for integrating or staging data. You can create an integration table as a regular table, an external table, or a temporary table. For example, you can load data to a staging table, perform transformations on the data in staging, and then insert the data into a production table.

Schema and table names

In SQL Data Warehouse, a data warehouse is a type of database. All of the tables in the data warehouse are contained within the same database. You cannot join tables across multiple data warehouses. This behavior is different from SQL Server, which supports cross-database joins.

In a SQL Server database, you might use fact, dim, or integrate for the schema names. If you are migrating a SQL Server database to SQL Data Warehouse, it works best to migrate all of the fact, dimension, and integration tables to one schema in SQL Data Warehouse. For example, you could store all the tables in the WideWorldImportersDW sample data warehouse within one schema called wwi. The following code creates a user-defined schema called wwi.

CREATE SCHEMA wwi;

To show the organization of the tables in SQL Data Warehouse, you could use fact, dim, and int as prefixes to the table names. The following table shows some of the schema and table names for WideWorldImportersDW. It compares the names in SQL Server with names in SQL Data Warehouse.

Table persistence

Tables store data either permanently in Azure Storage, temporarily in Azure Storage, or in a data store external to data warehouse.

Regular table

A regular table stores data in Azure Storage as part of the data warehouse. The table and the data persist regardless of whether a session is open. This example creates a regular table with two columns.

CREATE TABLE MyTable (col1 int, col2 int );

Temporary table

A temporary table only exists for the duration of the session. You can use a temporary table to prevent other uses from seeing temporary results and also to reduce the need for cleanup. Since temporary tables also utilize local storage, they can offer faster performance for some operations. For more information, see Temporary tables.

External table

An external table points to data located in Azure Storage blob or Azure Data Lake Store. When used in conjunction with the CREATE TABLE AS SELECT statement, selecting from an external table imports data into SQL Data Warehouse. External tables are therefore useful for loading data. For a loading tutorial, see Use PolyBase to load data from Azure blob storage.

Data types

SQL Data Warehouse supports the most commonly used data types. For a list of the supported data types, see data types in CREATE TABLE reference in the CREATE TABLE statement. Minimizing the size of data types helps to improve query performance. For guidance on using data types, see Data types.

Distributed tables

A fundamental feature of SQL Data Warehouse is the way it can store and operate on tables across 60 distributions. The tables are distributed using a round-robin, hash, or replication method.

Hash-distributed tables

The hash distribution distributes rows based on the value in the distribution column. The hash-distributed table is designed to achieve high performance for query joins on large tables. There are several factors that affect the choice of the distribution column.

Replicated tables

A replicated table has a full copy of the table available on every Compute node. Queries run fast on replicated tables since joins on replicated tables do not require data movement. Replication requires extra storage, though, and is not practical for large tables.

Round-robin tables

A round-robin table distributes table rows evenly across all distributions. The rows are distributed randomly. Loading data into a round-robin table is fast. However, queries can require more data movement than the other distribution methods.

Common distribution methods for tables

The table category often determines which option to choose for distributing the table.

Table category

Recommended distribution option

Fact

Use hash-distribution with clustered columnstore index. Performance improves when two hash tables are joined on the same distribution column.

Dimension

Use replicated for smaller tables. If tables are too large to store on each Compute node, use hash-distributed.

Staging

Use round-robin for the staging table. The load with CTAS is fast. Once the data is in the staging table, use INSERT...SELECT to move the data to a production tables.

Table partitions

A partitioned table stores and performs operations on the table rows according to data ranges. For example, a table could be partitioned by day, month, or year. You can improve query performance through partition elimination, which limits a query scan to data within a partition. You can also maintain the data through partition switching. Since the data in SQL Data Warehouse is already distributed, too many partitions can slow query performance. For more information, see Partitioning guidance.

Columnstore indexes

By default, SQL Data Warehouse stores a table as a clustered columnstore index. This form of data storage achieves high data compression and query performance on large tables. The clustered columnstore index is usually the best choice, but in some cases a clustered index or a heap is the appropriate storage structure.

Statistics

The query optimizer uses column-level statistics when it creates the plan for executing a query. To improve query performance, it's important to create statistics on individual columns, especially columns used in query joins. Creating and updating statistics does not happen automatically. Create statistics after creating a table. Update statistics after a significant number of rows are added or changed. For example, update statistics after a load. For more information, see Statistics guidance.

Commands for creating tables

You can create a table as a new empty table. You can also create and populate a table with the results of a select statement. The following are the T-SQL commands for creating a table.

Populates a new table with the results of a select statement. The table columns and data types are based on the select statement results. To import data, this statement can select from an external table.

Creates a new external table by exporting the results of a select statement to an external location. The location is either Azure Blob storage or Azure Data Lake Store.

Aligning source data with the data warehouse

Data warehouse tables are populated by loading data from another data source. To perform a successful load, the number and data types of the columns in the source data must align with the table definition in the data warehouse. Getting the data to align might be the hardest part of designing your tables.

If data is coming from multiple data stores, you can bring the data into the data warehouse and store it in an integration table. Once data is in the integration table, you can use the power of SQL Data Warehouse to perform transformation operations. Once the data is prepared, you can insert it into production tables.

Unsupported table features

SQL Data Warehouse supports many, but not all, of the table features offered by other databases. The following list shows some of the table features that are not supported in SQL Data Warehouse.

Table space summary

This query returns the rows and space by table. It allows you to see which tables are your largest tables and whether they are round-robin, replicated, or hash -distributed. For hash-distributed tables, the query shows the distribution column.

Distribution space summary

SELECT
distribution_id
, SUM(row_count) as total_node_distribution_row_count
, SUM(reserved_space_MB) as total_node_distribution_reserved_space_MB
, SUM(data_space_MB) as total_node_distribution_data_space_MB
, SUM(index_space_MB) as total_node_distribution_index_space_MB
, SUM(unused_space_MB) as total_node_distribution_unused_space_MB
FROM dbo.vTableSizes
GROUP BY distribution_id
ORDER BY distribution_id
;