Use the following tips to decide whether to partition and/or to configure bucketing,
and to select columns in your CTAS queries by which to do so:

Partitioning CTAS query results works well when the number of
partitions you plan to have is limited. In Athena, this number is up to 100
partitions. To choose the columns by which to partition your CTAS query results use
the following tips:

When you run a CTAS query, Athena writes the results to a specified location
in Amazon S3. If you specify partitions, it creates them and stores each partition
in a separate partition folder in the same location. The maximum number of
partitions you can configure with CTAS query results is 100.

Partition data by those columns that have similar characteristics, such as
records from the same department, and that can have a limited number of
possible values, such as a limited number of distinct departments in an
organization. This characteristic is known as data cardinality. For example, if
you partition by the column department, and this column has a
limited number of distinct values, partitioning by department
works well and decreases query latency.

Having partitions in Amazon S3 helps with Athena query performance, because
this helps you run targeted queries for only specific partitions. Athena then
scans only those partitions, saving you query costs and query time. For
information about partitioning syntax, search for partition_by
in CREATE TABLE AS.

Bucketing CTAS query results works well when you bucket
data by the column that has high cardinality and evenly distributed values.

For example, columns storing timestamp data could potentially
have a very large number of distinct values, and their data is evenly
distributed across the data set. This means that a column storing
timestamp type data will most likely have values and won't have
nulls. This also means that data from such a column can be put in many buckets,
where each bucket will have roughly the same amount of data stored in Amazon S3.

You can specify any number of buckets for your CTAS query results, using one
or more columns as bucket names. To choose the column by which to bucket the
CTAS query results, use the column that has a high number of values (high
cardinality) and whose data can be split for storage into many buckets that
will have roughly the same amount of data. For example, columns that are
sparsely populated with values are not good candidates for bucketing. This is
because you will end up with buckets that have less data and other buckets that
have a lot of data. By comparison, columns that you predict will almost always
have values, such as timestamp type values, are good candidates
for bucketing. This is because their data has high cardinality and can be
stored in roughly equal chunks. For more information about bucketing syntax,
search for bucketed_by in CREATE
TABLE AS.

To conclude, you can partition and use bucketing for storing results of the same CTAS
query. These techniques for writing data do not exclude each other. Typically, the
columns
you use for bucketing differ from those you use for partitioning.

For example, if your dataset has columns department,
sales_quarter, and ts (for storing timestamp type
data), you can partition your CTAS query results by department and
sales_quarter, because these columns have relatively low cardinality of
values; for example, a limited number of departments and sales quarters. Also, for
partitions, it does not matter if some records in your dataset have null or no values
assigned for these columns. What matters is that data with the same characteristics,
such
as data from the same department, will be in one partition that you can query in Athena.

At the same time, because all of your data has timestamp type values stored
in a ts column, you can configure bucketing for the same query results by the
column ts. This column has high cardinality and the data from it can be
distributed and stored in many buckets in Amazon S3. Consider an opposite scenario:
if you don't
create buckets for timestamp type data and run a query for particular date or time
values,
then you would have to scan a very large amount of data stored in a single location
in
Amazon S3. Instead, if you configure buckets for storing your date- and time-related
results,
you can only scan and query buckets that have your value and avoid long-running queries
that scan a large amount of data.

Javascript is disabled or is unavailable in your browser.

To use the AWS Documentation, Javascript must be enabled. Please refer to your browser's
Help pages for instructions.