Splitting a number of rows into equal groups

At the client site yesterday, one of the developers asked this question: “I want to take the large set of the data that I have in my gigantic table and split it up into balanced non-overlapping sets. I want the starting value in the set, the ending value and the number of records in that group. I need this to do my data processing logic for starting parallel concurrent processing of those large data sets. What is the best way to achieve this?”

This client site uses both Oracle and SQL Server so the solution had to work with both the RDBMS. Luckily enough, there is an analytic function in both which does this very easily for us. The function that we are talking about is NTILE(). Let’s take this up with an example:

SQL Server:

Let’s say that I want to split all the objects in sys.objects into say 10 different non-overlapping sets. And then as per the requirement, I need to list out the starting value in the set, the ending value and the total count of the records in the set. Here is a simple code sample that will do that:

select
min(object_id) as Starting_Value
, max(object_id) as Ending_Value
, count(*) as Total_Records
, grp_nbr as Group_Nbr
from
(
select object_id,
ntile(10) over (order by object_id) grp_nbr
from SYS.OBJECTS
) AS IV
group by grp_nbr;

Now, you will notice that not all sets have equal records but they are pretty close. That is by definition of the NTILE() function. Here is what BOL states:

If the number of rows in a partition is not divisible by expression, this will cause groups of two sizes that differ by one member. Larger groups come before smaller groups in the order specified by the OVER clause. For example if the total number of rows is 53 and the number of buckets is five, the first three buckets will have 11 rows and the two remaining buckets will have 10 rows each. If on the other hand the total number of rows is divisible by the number of buckets, the rows will be distributed evenly among the buckets. For example, if the total number of rows is 50, and there are five buckets, each bucket will contain 10 rows.

Oracle:

And the same function is available in Oracle as well. And the same code shown above will work in Oracle as well. Just replace SYS.OBJECTS with ALL_OBJECTS.

Pretty simple way of splitting a large data set into non-overlapping sets, isn’t it!