CLUSTERING FACTOR DEMYSTIFIED : PART – III

How to resolve the performance issues due to high clustering factor?

In my earlier post, Clustering Factor Demystified : Part – I, I had discussed that to improve the Clustering Factor, the table must be rebuilt (and reordered). The data retrieval can be considerably speeded up by physically sequencing the rows in the same order as the key column. If we can group together the rows for a key value, we can get all of the row with a single block read because the rows are together. To achieve this goal, various methods may be used. In the post Clustering Factor Demystified : Part -II, I had demonstrated Manual Row Re-sequenciung (CTAS with order by) which pre-orders data to avoid expensive disk sorts after retrieval. In this post, I will demonstrate the use of Single table hash clusters and Single table index clusters which clusters related rows together onto the same data block .

Overview:

Create a table organized which contains two columns -id(number) and txt (char)- Populate the table insert 34 records for each value of id where id ranges from 1 to 100- In this case as records are added sequentially,records for a key value are stored together

Create another table unorganized which is a replica of the ‘organized’ table but records are inserted in a random manner so that records for a key value may be scattered across different blocks .

Create a single table index clustertable from ‘unorganized’ table using CTAS.

Create a single table hash cluster table from ‘unorganized’table using CTAS

Tracethe query using exact match on three tables and verify that hash cluster table gives the best performance .

Trace the query using range scan on three tables and verify that index cluster table gives the best performance .

- Create a table organized which contains two columns – id(number) and txt (char)
– Populate the table insert 34 records for each value of id where id ranges from 1 to 100
– In this case as records are added sequentially, records for a key value are stored together

- create another table unorganized which is a replica of the ‘organized’ table but records are inserted in a random manner so that records for a key value may be scattered across different blocks (order by dbms_random.random).

- Find out no. of blocks across which records of a key value are spread in the two tables.- Note that in ‘unorganized’ table, records for an id are scattered across more than 30 blocks whereas in index_cluster_tab and hash_cluster_tab tables, records for each id are clustered i.e. records for each key value are spread across 5 blocks only.

where index_name in (‘UNORGANIZED_IDX’, ‘INDEX_CLUSTER_IDX’, ‘HASH_CLUSTER_IDX’);

INDEX_NAME CLUSTERING_FACTOR

—————————— —————–

HASH_CLUSTER_IDX 500

INDEX_CLUSTER_IDX 100

UNORGANIZED_IDX 3311

– Note that

– clustering factor of index on unorganized table approaches no. of rows in the table (3400).

– clustering factor of index on hash_cluster_tab table = 500 . As entries for each id are spread across 5 blocks, 500 blocks need to be accessed to get all the rows and index is aware of this information.

– clustering factor of index on index_cluster_tab table = 100 as there are 100 entries (one for each id) in the index. Here also 500 table blocks need to be accessed to get all the rows but index contains information about only the first(or may be the last) data block for an id. Rest 4 blocks containing records for that id are chained to it and index does not have that information and clustering factor of an index is computed on the basis of the information available in the index. That’s why clustering factor in this case = no. of index entries.

SUMMARY:

Clustered tables cannot be truncated.

Choosing the Key :Choosing the correct cluster key is dependent on the most common types of queries issued against the clustered tables. The cluster key should be on the column against which queries are most commonly issued.

HASH CLUSTERS

A hash cluster stores related rows together in the same data blocks. Rows in a hash cluster are stored together based on their hash value.

– Hash clusters are a great way to reduce IO on some tables, but they have their downside.

*If too little space is reserved for each
key (small SIZE value), or if the cluster is created with too few hash keys (small HASHKEYS), then each key will split across multiple blocks negating the benefits of the cluster.When creating a hash cluster, it is important to choose the cluster key correctly and set the HASH IS, SIZE, and HASHKEYS parameters so that performance and space use are optimal. * If too much space is reserved for each key (large SIZE value), or if the cluster is created with too many hash keys (large
HASHKEYS), then the cluster will contain thousands of empty blocks that slow down full table scans .A SIZE value much larger results in wasted space.

Hash clusters reduce contention and I/O since index is not accessed .When you use an index range scan + table access by index rowid, the root index block becomes a “hot block” causing contention for the cache buffers chains (cbc) latch and hence an increase in CPU usage.

A properly sized hash cluster for a lookup table gives pretty much a SINGLE IO for a keyed lookup.

Hash clusters should only really be used for tables which are static in size so that you can determine the number of rows and amount of space required for the tables in the cluster. If tables in a hash cluster require more space than the initial allocation for the cluster, performance degradation can be substantial because overflow blocks are required.

Hash clusters should only really be used for tables which have mostly read-only data. The hash cluster will take marginally longer
to insert into since the data now has a “place” to go and maintaining this structure will take longer then maintaining a HEAP table .Updates do not provide much overhead unless the hashkey is being updated.

Hash clusters should not be used in applications where most queries on the table retrieve rows over a range of cluster key values where a hash function cannot be used to determine the location of specific hash keys and instead, the equivalent of a full table scan must be done to fetch the rows for the query:

Hash clusters should not be used in applications where hash key is updated. The hashing values can not be recalculated and thus serious overflow can result.

Hash clusters should not be used for tables which are not static and continually growing. If a table grows without limit, the space required over the life of the table (its cluster) cannot be predetermined.

Hash clusters should not be used for when you cannot afford to pre-allocate the space that the hash cluster will eventually
need.

Hash clusters allocate all the storage for all the hash buckets when the cluster is created, so they may waste space.

Full scans on single table hash clusters will cost as much as they would in a heap table.

INDEX CLUSTERS

In an indexed cluster, Oracle stores together rows having the same cluster key value. Each distinct cluster key value is stored only once in each data block, regardless of the number of tables and rows in which it occurs. This saves disk space and improves performance for many operations.

Index clusters should be used for the apllications where most queries on the table retrieve rows over a range of cluster key values. For example, in full table scans or queries such as the following:

SELECT . . . WHERE cluster_key < . . . ;

With an index, key values are ordered in the index, so cluster key values that satisfy the WHERE clause of a query can be found with relatively few I/Os.

Index clusters should be used for the tables which are not static, but instead are continually growing and the space required over the life of the table (its cluster) cannot be predetermined.

Index clusters should be used for applications which frequently perform full-table scans on the table and the table is sparsely populated. A full-table scan in this situation takes longer under hashing.

Cluster index has one entry per cluster key and not for each row. Therefore, the index is smaller and less costly to access for finding multiple rows.