Main menu

Post navigation

SQL for the Balanced Number Partitioning Problem

I noticed a post on AskTom recently that referred to an SQL solution to a version of the so-called Bin Fitting problem, where even distribution is the aim. The solution, How do I solve a Bin Fitting problem in an SQL statement?, uses Oracle’s Model clause, and, as the poster of the link observed, has the drawback that the number of bins is embedded in the query structure. I thought it might be interesting to find solutions without that drawback, so that the number of bins could be passed to the query as a bind variable. I came up with three solutions using different techniques, starting here.

An interesting article in American Scientist, The Easiest Hard Problem, notes that the problem is NP-complete, or certifiably hard, but that simple greedy heuristics often produce a good solution, including one used by schoolboys to pick football teams. The article uses the more descriptive term for the problem of balanced number partitioning, and notes some practical applications. The Model clause solution implements a multiple-bin version of the main Greedy Algorithm, while my non-Model SQL solutions implement variants of it that allow other techniques to be used, one of which is very simple and fast: this implements the team picking heuristic for multiple teams.

Another poster, Stew Ashton, suggested a simple change to my Model solution that improved performance, and I use this modified version here. He also suggested that using PL/SQL might be faster, and I have added my own simple PL/SQL implementation of the Greedy Algorithm, as well as a second version of the recursive subquery factoring solution that performs better than the first.

This article explains the solutions, considers two simple examples to illustrate them, and reports on performance testing across dimensions of number of items and number of bins. These show that the solutions exhibit either linear or quadratic variation in execution time with number of items, and some methods are sensitive to the number of bins while others are not.

After I had posted my solutions on the AskTom thread, I came across a thread on OTN, need help to resolve this issue, that requested a solution to a form of bin fitting problem where the bins have fixed capacity and the number of bins required must be determined. I realised that my solutions could easily be extended to add that feature, and posted extended versions of two of the solutions there. I have added a section here for this.

Updated, 5 June 2013: added Model and RSF diagrams

Update, 18 November 2017: I have now put scripts for setting up data and running the queries in a new schema onto my GitHub project: Brendan’s repo for interesting SQL. Note that I have not included the minor changes needed for the extended problem where finding the number of bins is part of the problem.

Greedy Algorithm Variants

Say there are N bins and M items.

Greedy Algorithm (GDY)
Set bin sizes zero
Loop over items in descending order of size

Add item to current smallest bin

Calculate new bin size

End Loop

Greedy Algorithm with Batched Rebalancing (GBR)
Set bin sizes zero
Loop over items in descending order of size in batches of N items

Assign batch to N bins, with bins in ascending order of size

Calculate new bin sizes

End Loop

Greedy Algorithm with No Rebalancing – or, Team Picking Algorithm (TPA)
Assign items to bins cyclically by bin sequence in descending order of item size

Two Examples

Example: Four Items
Here we see that the Greedy Algorithm finds the perfect solution, with no difference in bin size, but the two variants have a difference of two.

Example: Six Items
Here we see that none of the algorithms finds the perfect solution. Both the standard Greedy Algorithm and its batched variant give a difference of two, while the variant without rebalancing gives a difference of four.

SQL Solutions

Original Model for GDY
See the link above for the SQL for the problem with three bins only.

The author has two measures for each bin and implements the GDY algorithm using CASE expressions and aggregation within the rules. The idea is to iterate over the items in descending order of size, setting the item bin to the bin with current smallest value. I use the word ‘bin’ for his ‘bucket’. Some notes:

Dimension by row number, ordered by item value

Add measures for the iteration, it, and number of iterations required, counter

My Model solution works for any number of bins, passing the number of bins as a bind variable. The key idea here is to use values in the first N rows of a generic bin value measure to store all the running bin values, rather than as individual measures. I have included two modifications suggested by Stew in the AskTom thread.

Dimension by row number, ordered by item value

Initialise a bin measure to the row number (the first N items will remain fixed)

The idea here is that in the initial RSF query a subquery factor of items was joined on a calculated field, so the whole record set had to be read, and performance could be improved by putting that initial record set into an indexed temporary table ahead of the main query. We'll see in the performance testing section that this changes quadratic variation with problem size into linear variation.

The idea here is that procedural algorithms can often be implemented more efficiently in PL/SQL than in SQL.

The first parameter to the function is a strongly-typed reference cursor

The SQL call passes in a SELECT statement wrapped in the CURSOR keyword, so the function can be used for any set of records that returns name and numeric value pairs

The item records are fetched in batches of 100 using the LIMIT clause to improves efficiency

Performance Testing
I tested performance of the various queries using my own benchmarking framework across grids of data points, with two data sets to split the queries into two sets based on performance.

Little variation of CPU time with number of bins for POSS and RSFT, but roughly linear variation for PLFN

These linear methods are much faster than the earlier quadratic ones for larger numbers of items

Its approximate proportionality of time to number of bins means that, while PLFN is faster than RSFT for small number of bins, it becomes slower from around 50 bins for our problem

The proportionality to number of bins for PLFN presumably arises from the step to find the bin of minimum value

The lack of proportionality to number of bins for RSFT may appear surprising since it performs a sort of the bins iteratively: However, while the work for this sort is likely to be proportional to the number of bins, the number of iterations is inversely proportional and thus cancels out the variation

Solution Quality

The methods reported above implement three underlying algorithms, none of which guarantees an optimal solution. In order to get an idea of how the quality compares, I created new versions of the second set of queries using analytic functions to output the difference between minimum and maximum bin values, with percentage of the maximum also output. I ran these on the same grid, and report below the results for the four corners.

The results indicate that GDY (Greedy Algorithm) and GBR (Greedy Algorithm with Batched Rebalancing) generally give very similar quality results, while TPA (Team Picking Algorithm) tends to be quite a lot worse.

Extended Problem: Finding the Number of Bins Required

An important extension to the problem is when the bins have fixed capacity, and it is desired to find the minimum number of bins, then spread the items evenly between them. As mentioned at the start, I posted extensions to two of my solutions on an OTN thread, and I reproduce them here. It turns out to be quite easy to make the extension. The remainder of this section is just lifted from my OTN post and refers to the table of the original poster.

Start OTN Extract
So how do we determine the number of bins? The total quantity divided by bin capacity, rounded up, gives a lower bound on the number of bins needed. The actual number required may be larger, but mostly it will be within a very small range from the lower bound, I believe (I suspect it will nearly always be the lower bound). A good practical solution, therefore, would be to compute the solutions for a base number, plus one or more increments, and this can be done with negligible extra work (although Model might be an exception, I haven't tried it). Then the bin totals can be computed, and the first solution that meets the constraints can be used. I took two bin sets here.

3 thoughts on “SQL for the Balanced Number Partitioning Problem”

Thanks a lot for sharing “Brendan’s Generic Model for GDY” – it’s been really useful.
I’m doing the bin fitting in chunks of rows and not all at once so I altered it to my needs to get the current bin sizes from another table for the initialization.

Hi, Thanks for offering so many solutions to choose from 🙂
I’m after something similar, I would be interested to use “Brendan’s Generic Model for GDY”, for varied number of bins. I tried this solution on my toad client, and :N_Bins bind variable is not visible (not sure if it’s my version ??) when I tried to run the query.

Say, if I have a table (test) which tells me how many bins (bin_count) to split on, how can I use this count value in the above algorithm.