I have a directory tree that I would like to back up to optical disks. Unfortunately, it exceeds the size of any one disk (it's about 60GB). I am looking for a script that would split this tree into appropriately sized chunks with hard links or whatnot (leaving the original untouched). I could then feed these bite-size trees into the backup process (add PAR2 redundancy, etc.).

It's not a fancy script, but it seems like it might have already been done. Suggestions?

(Spanning and writing in one step is a no-go because I want to do more stuff before the files get burned.)

I once made an ugly script for a similar purpose. It is just a kludge, but when I wrote it I didn't care about execution time or prettiness. I'm sure there are more "productified" versions of the same concept around, but If you wish to get some ideas or something to start hacking on, here goes (did it in 2008, so use at your own risk!) :-)

@Gilles: I've done plenty of reading since 2008 ;-) Changes to make the script more generic are good. (I dislike the introduction of [ as opposed to test, though)...
–
MattBiancoMar 29 '11 at 14:54

You should lower case most of those variables. By convention, we capitalize environment variables (PAGER, EDITOR, SHELL, ...) and internal shell variables. All other variable names should contain at least one lowercase letter. This convention avoids accidentally overriding environmental and internal variables.
–
Chris DownSep 18 '11 at 21:56

distribute -- Distribute a collection of packages on
multiple CDs (especially good for
future use with APT)

Description:
`distribute' program makes doing the
tasks related to creating a CD set for
distribution of a collection of
packages easier. The tasks include:
laying out the CDs filesystem
(splitting the large amount of
packages into several discs etc.),
preparing the collection for use by
APT (indexing), creating ISO images
and recording the discs.

Periodical updates to the initially
distributed collection can be issued
with help of `distribute'.

It does the whole process in several stages: at one stage, it creates the furure disk "layouts" by using symlinks to the original files -- so you can intervene and change the future disk trees.

The details about its usage can be read in the help message printed by the script (or by looking into the source code).

It was written with a more trickier use case in mind (issuing updates as a "diff"--the set of added new files--to the originally recorded collection of files), so it includes one extra initial stage, namely, "fixing" the current state of the collection of files (for simplicity, it does this by replicating the original collection of files by means of symlinks, in a special working place for saving the states of the collection; then, some time in the future, it will be able to create a diff between a future current state of the collection of files and this saved state). So, although you might not need this feature, you can't skip this initial stage, AFAIR.

Also, I'm not sure now (I wrote it quite a few years ago) whether it treats complex trees well, or it is supposed to split only plain (one level) directories of files. (Please look into the help message or the source code to be sure; I'll look this up, too, a bit later, when I'll have some time.)

The APT-related stuff is optional, so don't pay attention that it can prepare package collections to be used by APT if you don't need this.

If you get interested, of course, feel free to rewrite it to your needs or suggest improvements.

(Please pay attention that the package includes additional useful patches not applied in the presented code listing at the Git repo linked above!)

We shouldn't forget that the essence of the task is indeed quite simple; as put in a tutorial on Haskell (which is written around the working through of the solution for this task, incrementally refined)

Now let's think for a moment about how
our program will operate and express
it in pseudocode:

main = Read list of directories and their sizes.
Decide how to fit them on CD-Rs.
Print solution.

Sounds reasonable? I thought so.

Let's simplify our life a little and
assume for now that we will compute
directory sizes somewhere outside our
program (for example, with "du -sb *")
and read this information from stdin.

(Additionaly, in your question, you'd like to be able to tweak (edit) the resulting disk layouts, and then use a tool to burn them.)

You could re-use (adapt and re-use) a simple variant of the program from that Haskell tutorial for splitting your file collection.

Unfortunately, in the distribute tool that I've mentioned here in another answer, the simplicity of the essential splitting task is not matched by the complexity and bloatedness of the user interface of distribute (because it was written to combine several tasks; although performed in stages, but still combined not in the cleanest way I could think of now).

To help you make some use of its code, here's an excerpt from the bash-code of distribute (at line 380) that serves to do this "essential" task of splitting a collection of files:

Note that the eatFiles function prepares the layouts of the future disks as trees where the leaves are symlinks to the real files. So, it is meeting your requirement that you should be able to edit the layouts before burning. The mkisofs utility has an option to follow symlinks, which is indeed employed in the code of my mkiso function.

The presented script (which you can take and rewrite to your needs, of course!) follows the simplest idea: to sum the sizes of files (or, more precisely, packages in the case of distribute) just in the order they were listed, don't do any rearrangements.

The "Hitchhikers guide to Haskell" takes the optimization problem more seriously and suggests program variants that would try to re-arrange the files smartly, in order for them to fit better on disks (and require less disks):

Enough preliminaries already. let's go
pack some CDs.

As you might already have recognized,
our problem is a classical one. It is
called a "knapsack problem"
(google it up, if you don't know
already what it is. There are more
than 100000 links).

Other smart tools

I've been told also that Debian uses a tool to make its distro CDs that is smarter than my distribute w.r.t. collections of packages: its results are nicer because it cares about inter-package dependencies and would try to make the collection of packages that gets on the first disk closed under dependencies, i.e., no package from the 1st disk should require a package from another disk (or at least, I'd say, the number of such dependencies should be minimized).

"bite-size trees" doesn't quite sound like rar, unless you unpack each "part" again into its own directory, which of course won't work, since the parts are not designed like that, and not split on file boundaries.
–
MattBiancoMar 28 '11 at 11:16

1

If talking about tools that give tar+split-like results, then there's also dar; here's the note about its relevant feature: "(SLICES) it was designed to be able to split an archive over several removable media whatever their number is and whatever their size is". Compared to tar+split, I assume, it allows some easier ways to access the archived files. (BTW, it has also a feature resembling distribute: "DIFFERENTIAL BACKUP" & "DIRECTORY TREE SNAPSHOT", but one may not like that the result is a special format, not an ISO with a dir tree.)
–
imz -- Ivan ZakharyaschevMar 29 '11 at 23:43