amaguk has asked for the wisdom of the Perl Monks concerning the following question:

Hi,

I've some questions about a practical problem. I've a directory with a lot of files. I want to burn some CD with these files, but how can I maximise the content of each CD, and minimize the total number of CD ?

Is there an existing script (nothing found during my search, but...).

Is there a good algorithm ?

My first thinking are :
- number of CD = round up (total size in Mo / 700Mo)
- in CD1, I put the largest file if total size of CD1 is < 700Mo or on the following CD;
- in CD2, I put the second largest file if total size of CD2 is < 700Mo or on the following CD;
- in CD3, I put the third largest file if total size of CD3 is < 700Mo or on the following CD;
- in CDn, I put the nth largest file if total size of CDn is < 700Mo or on the following CD;
- and I loop on the first CD !
- If there is always files and all my n CD are full, I create a new CD where I put these files.

amaguk,
This is like the knapsack problem, but not quite. The difference is that you don't need to hit an exact target, you just need to not waste any more CDs then an exact target.

For instance, you have 2GB worth of files. A perfect solution would have that fitting on 3 CDs with room to spare. As long as your solution doesn't require 4 CDs - you have sufficiently solved the problem. I got into a heated debate on this exact same problem in IRC some time ago and could have swore that I posted about it here - but can't find it. There is a recent similar thread (Burning ISOs to maximize DVD space), which mentions Algorithm::Bucketizer which I haven't tried myself. I would attempt the following:

$buckets = ($total_size / 700) + 1

Order files by size in descending order

Round robin files (1 per bucket)

When you encounter first file that will not fit, stay with that bucket but continue down the list until you find one that fits

On the next bucket, start back at the top of the file list

Wash, rinse, repeat

I am pretty sure the method will work. I was going to test it but the person complaining in IRC wouldn't provide a list of file sizes for me to try it out on and I wasn't motivated enough to make some up.

Won't always work, though I can't prove how far off it'll be. Mainly, it should work pretty well when you have lots of extra space, but an approximate solution won't be "close enough" if you end up too close to exactly filling all the discs. Consider files of size 350, 349, 233, 232, and 231. You can fit them on 2 700 Mb discs (350+349, 233+232+231), but your algorithm will use 3 dics. (If you tried to use only 2, you'd end up with 350+233, 349+232, and the 231 wouldn't fit on either).

What I can't prove without a lot more thought is whether you're ever going to be off by more than a single disc, and what can't be proven at all is whether that's close enough for real world purposes. (Since what's "acceptable" in the real world has to do with how long you're willing to wait for an answer vs. how much you care about that extra disc, and other factors.) But just know that the greedy approach won't only be suboptimal in theory, but it will also, sometimes, bleed over into an actual difference.

What I didn't say explicitly, but was implied by my bullet points was that 1 disc is being added to account for perfect (or even near perfect fits). The knapsack problem is hard but we aren't trying to break encryption we are trying to save a few pennies on CDs. I don't think (though I could be wrong) that it will ever waste more than 1 disk. Too make matters more difficult, we aren't talking about a handful of files but more likely hundreds if not thousands. Let's say that the total size is an exact multiple of 1 CD. That means every single CD needs to be an exact match (which may not even be possible). Proving it can or can not might take a while (extreme sarcasm). Why not just go with a "good enough" solution?

Update 2008-11-26: It turns out that this heuristic approach can be is much as 11/9 OPT + 1 bin (according to bin packing). While my experience has been that 1 extra is all you will ever need, it is possible to need more.

MidLifeXis,
If each file is ($bucketsize / 2) + 1, it means only 1 file can fit per CD with either method so mine still only wastes the 1 extra CD. I am failing to see how your worst case scenario would make my solution use more than 1 extra CD?

This question tends to come up quite often lately. As it has already been pointed out to you it's basically the knapsack problem, which is known to be generally a "hard" problem. However a practical answer may depend on the actual average file sixe: if you only have files whose size is about say 1Mb or less, or at least you have a good wealth of such files along with potentially larger ones, then you may be content with a suboptimal solution given by filling up the space with as many of those files as possible.

As a side note, outside of France (for what I know) Mo is spelled Mb...

You could always use an archiving tool to compress all of the files and span them to the desired media size. I brushed off a version of pkzipc (on windows) and had a play. The following command compresses the data and creates a number of 700Mb files suitable for dropping onto CD.
pkzipc -add -span=1.44 c:save.zip *.doc

I've already thinked to archive tools, but the problem is if I want to read a specific file on one disk. I must rebuild the archive and extract it. It's too much effort for one file.

And it's not a homework ;) Just a practical problem (I've a lot of PDF files from articles, excerpt of books, etc and I want to store all these files on CD), and my curiosity to do this with efficience