Pages

Sunday, October 6, 2013

My colleague Ronald Rood recently had a nice SQL challenge for me. He had to perform an export of all the tables in a schema the old fashioned way and wanted to manually parallellize the operation. For that to work, all tables need to be assigned to a group.

For the parallellization to work, the groups need to be balanced. If, say, all large tables are in one group, the parallellization won't help much, because the total time wouldn't be reduced much. So the question is how to distribute the tables evenly where all groups approximately have the same total size in number of bytes. Another requirement, which came later, was to also take into account the total number of tables, and distribute the number of tables evenly as well. The schema contained 90,000 tables from a standard software application, of which most tables were empty.

Ronald supplied a create table script for me to test with, simulating the content of his dba_tables dictionary view, which is the source for his query. The test table contains 14,999 records. Here is the table and a select statement that gives you an idea what's in the table.

A few explanatory remarks. I needed some extra
measures to help with the calculation of the average size of the groups:

the running sum of bytes

The analytic function "sum(bytes) over (order by bytes desc)" calculates this running sum. Divides by the :NUMBER_OF_GROUPS, this gives us the average size of the groups, needed in the algorithm.

the current size of all groups

The bytes "array" contains 14,999 elements, but the first NUMBER_OF_GROUPS (10) elements are used for the intermediate results of the group size.

the total number of elements

This one is calculated by a simple "count(*) over ()" and is needed in the UNTIL clause to stop the iteration when there are no more elements.

The whole statement is a tad ugly because the case expression has 10 repeating WHEN..THEN constructions, but it does the job and it allows you to play with the NUMBER_OF_GROUPS variable, which needs to be between 1 and 10. But, it is another nice use case for the SQL model clause, and it might come in handy for other DBA's or developers who need a single query to distribute their data evenly.

UPDATE:
Of course it has turned out the problem isn't new.Here is a blogpost from Brendan Furey about the same subject.
And here is the AskTom thread he refers to with several model clause queries and one using recursive subquery factoring.