Bin packing part 2: Packing it tighter

In my previous post, I explained the bin packing problem, explained an example scenario, and established a baseline for both speed and efficiency of bin packing algorithms by writing a rather crude cursor-based procedure. In this part, I will look at some small modifications that can be made to this code to make it better at packing bins as full as possible, so that less bins will be needed.

Reordering the cursor

The Baseline procedure didn’t process the registrations in any particular order. Granted, there is an ORDER BY Year, Quarter in the cursor declaration, but that is only needed to enable me to generate consecutive session numbers within each quarter without having to do a SELECT MAX(SessionNo) query. Had I elected to use an IDENTITY column for the Sessions table, I could even have omitted the ORDER BY completely.

Since there is no order specified for the registrations within the quarter, we can assume that they will be processed in some unspecified order. But maybe we can get a more efficient distribution of registrations if we order them before processing? I am, of course, not referring to ordering by subject code, as the actual subjects are irrelevant for the algorithm – I am referring to order by the number of candidates in a registration.

This is very easy to test, of course. All it takes is adding one extra column to the ORDER BY clause of the cursor definition. So I created the procedures dbo.OrderAsc and dbo.OrderDesc to test the effects of ordering by ascending or descending number of candidates (see OrderAsc.sql and OrderDesc.sql in the attached ZIP file).

Ordering by ascending number of candidates turned out to be a pretty lousy idea. Well, to be fair, I didn’t expect otherwise – after all, if you save all the biggest registrations for the last, you’ll have no smaller registrations left to fill up the gaps. In fact, all registrations for 51 or more candidates will get a session of their own, and will not be combined with any other registration. So it’s not surprising at all to see that this method results in a huge amount of extra sessions as compared to the baseline version – an increase of no less than 19.9%!

Quarter NumSessions AvgSessionSizeAvgEmptySeats

------- ----------- ---------------- ----------------

1643578.11686014081.800000

2341981.7414446242.600000

3779677.62416617444.200000

4568673.60200415009.900000

ALL2333677.38322713194.625000

Equally unsurprising is the fact that changing the ORDER BY clause to sort by descending number of candidates results in more successful packing of the registrations in less sessions. This version saves 2.7% as compared to the baseline, as shown in this breakdown:

Quarter NumSessions AvgSessionSizeAvgEmptySeats

------- ----------- ---------------- ----------------

1508598.855850581.800000

2279999.84780242.600000

3677189.3749817194.200000

4426898.055529829.900000

ALL1892395.4296352162.125000

Even though this part mainly focuses on efficiency achieved, the time taken still remains an important factor. So I ran these two procedures five times in a row, recorded execution times, and calculated the average.For dbo.OrderAsc, the average execution time was 78,531 ms, whereas dbo.OrderDesc clocked in at 77,934 ms. As you see, both versions are slightly slower than the baseline version. For the OrderDesc version, this is caused by the rapid growth of the number of sessions as the first, biggest registrations are processed first. This means that searching a session with enough empty space for a registration soon becomes a more time-consuming task. For OrderAsc, the reverse is true – since the smallest registrations are processed first, there will at first be only a few sessions. This means that this algorithm will be a lot faster at first – but once the bigger registrations are processed and the total number of sessions rapidly increases to be way more that that in the baseline version, this advantage is lost, and the time required to search for sessions with enough empty space soon gets so high that the advantage this algorithm had as first then turns into a disadvantage.

Sorting the registrations by ascending number of candidates within a quarter before processing them hurts both speed and packing efficiency of the algorithm; we can henceforth forget about this option. On the other hand, sorting by descending number of candidates increases the packing efficiency by 2.7%, though this comes at the price of a 5.7% increase in execution time. If I had to choose between these two, my pick would depend on the needs of the organization I’m working for – if the cost per session is high and plenty of time is available for the computation, I’d go with the ordered version, but if speed is moreimportant than saving those few extra sessions, I’d use the unsorted one.

Less obvious orderings

But why choose between these two only? I have thus far only considered the obvious sort orders, ascending and descending. Why not try a few more variations?

Thinking about the even distribution of data generated for the first quarter of each year in my test set, I considered that, were I tasked with manually combining the registrations as efficient as possible, I’d probably start by making sessions by combining two registrations for 50 candidates, than combining a registration for 51 candidates with one for 49, and so on. All of these sessions would total 100 candidates, and because of the even distribution of data, I expect roughly the same number of registrations for each size so I’d have only a few spare sessions left in the end.

That technique can’t be exactly mimicked by changing the sort order of the cursor. There are other ways to mimic it, though – but I’ll leave those for a future post J. But we can simulate this effect by ordering the registrations so that those with 50 candidates come first, then those with 49 and 51 candidates, and so on. This is done by changing the ORDER BY clause in the cursor definition to order by the “distance” between the number of candidates and the magic number 50, being half the maximum session size:

ORDERBYYear,Quarter,

ABS(NumCandidates -(@MaxCandidatesPerSession / 2.0))ASC;

I didn’t hardcode the number 50, because I wanted my stored procedures to be fit for any maximum number of candidates. I divide by 2.0 instead of just 2 so that for an odd maximum session size (e.g. 25), the fraction is retained and registrations for 12 and 13 candidates are kept together because they are the same distance from half the maximum size (12.5).

It is of course also possible to use DESC instead of ASC to start with registrations for 100 candidates, then those for 99 or 1, and so on, saving the 50-candidate registration for the last. Both these versions are included in the attached ZIP file, in the files Order50First.sql and Order50Last.sql.

These versions both took slightly more time than the baseline version when I tested them on my laptop: 76,807 ms for dbo.Order50First, and 77,322 ms for dbo.Order50Last. The packing efficiency of dbo.Order50First is better than the baseline, but not as good as that of dbo.OrderDesc:

Quarter NumSessions AvgSessionSizeAvgEmptySeats

------- ----------- ---------------- ----------------

1509698.642464691.800000

2280599.634224102.600000

3678089.2563427284.200000

4426998.032560839.900000

ALL1895095.2936672229.625000

For dbo.Order50Last, the resulting number of sessions is even more than we had in the baseline!

Quarter NumSessions AvgSessionSizeAvgEmptySeats

------- ----------- ---------------- ----------------

1544492.3368844171.800000

2303292.1748022372.600000

3692887.3495958764.200000

4461790.6434914319.900000

ALL2002190.1960444907.125000

The reason for the disappointing efficiency of the dbo.Order50First procedure is that there is no control over the order of the registrations that have the same distance to 50. So it is quite possible, for instance, to start with a bunch of registrations for 49 candidates that will promptly be combined to sessions for 98 candidates each – so that, when the 51-sized registrations start coming in, they have to get sessions of their own. In an attempt to fix that, I tweaked the sort order some more, making sure that the for registrations with the same distance from 50, the “bigger” registrations come before the “smaller” ones.

ORDERBYYear,Quarter,

ABS(NumCandidates -(@MaxCandidatesPerSession / 2.0))ASC,

NumCandidates DESC;

With this ORDER BY clause, I can be certain that all 51-candidate registrations are processed first, each getting its own session. After that, the 49-candidate registrations will exactly fill out all those sessions. This version (enclosed in Order50FirstB.sql) had a slightly better packing ration than dbo.Order50First – but still not as good as dbo.OrderDesc. Here are the results, which took 75,547 ms (on average) to achieve:

Quarter NumSessions AvgSessionSizeAvgEmptySeats

------- ----------- ---------------- ----------------

1508998.778148621.800000

2280499.66975792.600000

3677189.3749817194.200000

4426898.055529829.900000

ALL1893295.3842702184.625000

After these tests, there was still one thing left I wanted to try. Starting with registrations for 50 places was based on an idea for evenly distributed data. For other distributions, this might turn out to be a much worse idea (though the results don’t show as much). But what if, instead of starting at half the maximum session size, we start at the average registration size? For evenly distributed data, this should work out approximately the same. But maybe this order achieves a better packing ratio for other distributions? Let’s find out.

Ordering by distance from the average session size for a quarter can be accomplished by using a correlated subquery in the ORDER BY clause (compatible with all versions of SQL Server), or by using an AVG function with the OVER clause (only SQL Server 2005 and up):

ORDERBY a.Year, a.Quarter,

ABS(a.NumCandidates -(SELECTAVG(b.NumCandidates * 1.0)

FROMdbo.Registrations AS b

WHEREb.Year= a.Year

ANDb.Quarter = a.Quarter))ASC;

or

ORDERBY a.Year, a.Quarter,

ABS(a.NumCandidates -AVG(a.NumCandidates * 1.0)

OVER(PARTITIONBY a.Year, a.Quarter))ASC;

Surprisingly, when dry-testing the query by itself, the correlated subquery turned out to be faster than the one using the OVER clause, so I didn’t have to sacrifice speed for backward compatibility. I used the correlated subquery, both with the ASC and the DESC sort option (see OrderHalfFirst.sql and OrderHalfLast.sql in the attachment), to test the two possible variations of this option. Both versions turned out to be quite inefficient packers, since they both took more sessions than the baseline. Here are the results of dbo.OrderHalfFirst, acquired in 75,056 ms:

Quarter NumSessions AvgSessionSizeAvgEmptySeats

------- ----------- ---------------- ----------------

1512098.180078931.800000

2326485.6231614692.600000

3677189.3749817194.200000

4504482.9700638589.900000

ALL2019989.4012075352.125000

And OrderHalfLast, after running 79,294 ms, produced these results:

Quarter NumSessions AvgSessionSizeAvgEmptySeats

------- ----------- ---------------- ----------------

1603683.28064910091.800000

2294195.0268611462.600000

3779677.62416617444.200000

4495184.5285807659.900000

ALL2172483.1253459164.625000

Conclusion

I’ve investigated many options to increase packing efficiency. It turns out that, at least with the test data I used, just starting with the biggest registration and working down to the smallest yields the best results. This is not the fastest option, though. The baseline version discussed in the previous episode of this series is still fastest. So the choice would appear to depend on the requirements of your application – if you have plenty of time and computer capacity but need to use as little sessions as possible, go for the dbo.OrderDesc version. If execution time is of utmost importance and a few extra sessions are no big deal, then stick to the baseline (for now).

If you are in search of a solution that offers both speed and efficient packing, then the dbo.Order50FirstB version seems to be the right choice. It is only 0.05% less efficient than the best packer (dbo.OrderDesc), but over 3% faster. In the next episode I’ll be looking at ways to make the algorithm go faster. I’ll be making huge performance improvements – but packing efficiency will suffer. How much? As soon as I have completed all my tests and written the accompanying text, you’ll read it. Right here on sqlblog.com.

Leave a Comment

About Hugo Kornelis

Hugo is co-founder and R&D lead of perFact BV, a Dutch company that strives to improve analysis methods and to develop computer-aided tools that will generate completely functional applications from the analysis deliverable. The chosen platform for this development is SQL Server. In his spare time, Hugo likes to visit the SQL Server newsgroups, in order to share and enhance his knowledge of SQL Server.