"Geometry" - an idea.

Introduction

The handling of disklabels and slices and whatsnot, have until now been
a rather ad-hoc proposition. Some lump of code somewhere did it,
partly through magic and that was that.

This is my proposal to make a structural framework for this area of
a UNIX kernel.

The same "basic" kind of setup under "geometry" would look like this:

Now the SCSI and WD drivers just provide access to the hardware, they
don't know anything about layout, and separate "methods" do the handling
of layout, slicing and partitioning.

The red circles mark "geometry" devices. These are the points
in the graph which can be accessed from /dev/something or other.

A geometry device has the following public properties:

A name like "sd0", "sd0s1", "mirror1" or "foo". This is the name
which will appear in /dev for this device.

A sector size. I/O must be done in transactions of an integral number
of sectors of this size.

A size in number of sectors.

The name of the method providing this device.

Now, lets look at a more advanced setup:

This is basically a machine with two mirrored disks. I will
use this to illustrate an important concept of "geometry": on the fly insertion.

When the machine boots, let say from sd0, we need to find a suitable
root filesystem. Since we want to be backwards compatible, the MBR and
BSD methods will be self-identifying; ie: they will examine the available
devices and instantiate themselves on those devices on which they find
their respective magic sectors.

So at the time when /sbin/init gets executed the picture looks like
this:

So before we mount anything read/write, we want to activate the mirroring:

Dismantle the BSD method on sd1 (the top right box)

Dismantle the MBR method on sd1 (the one right below)

Dismantle the MBR method on sd0

Now, how and why can we do that ? Well, in this case we use the "dangerousely
dedicated mode" really, the MBR represents a 1:1 mapping in that case
and since it is transparent we can remove it without affecting the mounted
filesystem.

Now it looks like this:

Next, using the same set of conditions we enable the mirror:

Insert mirror between SCSI/sd0 and BSD

Attach SCSI/sd1 to mirror

The reason why we can insert a mirror just like that, is that the mirror
is also a 1:1 mapping when it has only one child.

Now we're back to the setup we started with:

There are no limits to what a method can do really. Here is a
beastiarium over some of the ones I can imagine:

BSD

Understands BSD style disklabels

MBR

Understands DOS/MBR/FDISK style disklabels

MIRROR

mirrors data over multiple lower devices

CONCAT

Concatenates a number of lower devices into one larger device

STRIPE

Like CONCAT, but with interleaved layout.

RAID-5

Raid-5 method over a number of lower devices.

INTERLEAVE

This is the opposite of STRIPE. It interleaves a number of upper
devices onto one lower device. For two interleaves devices, all the
even numbered sectors on the lower device will belong to the first upper
device and the odd numbered ones to the other.

COW

"Copy On Write" Imagine the case were you had a nasty crash and fsck barfs
badly over one of your filesystems. The temptation to just run "fsck
-y" is there, but what will happen ? Well you put a "COW" on
your device, and tell the "COW" to use your swap partition for temporary
storage. Then you say fsck -y on your COWed device. The COW module
will look just like a normal device, but all the writes fsck does will
be stored in the temporary storage until you tell COW to "commit".
So if fsck -y looked ok, you mount the device, peek around find nothing
important missing and tell "COW" to commit, COW will copy all
the blocks written by fsck from temporary storage to the "real" device
and we're all happy. If on the other hand fsck -y removed pretty
much everything on the filesystem, you will probably tell "COW" to
"abandon" and take the long road home to recovery. Call it the "What
if ?" method if you like, but it is my favourite method.

APPLE, SUN, MVS, XENIX

These are various methods to read the disklabels as they look on various
other machines and OSs.

"YOUR IDEA GOES HERE"

A method should hopefully be something very simple to write, so if you
have a good idea...

Summary

I hope the above gives an idea about what I'm talking about, otherwise
yell.

The basic idea is kind of a LEGO inspired idea: you have a number
of bricks and you put them together as you like. All the various
commercial systems I have tried impose a hierarchy on the methods
they provide (pdisk, disk, subdisk, plex, volume for instance), and I don't
like that straight-jacket. If I feel like mirroring before I partition
I should be allowed to do that. I probably have a reason for
wanting to.

It's about providing tools, not policies really...

I actually had a prototype of this running, but it suffered badly
from "second systems syndrome", so a fresh start should be made. I am
unlikely to have the time for it, unless I find a sponsor for it.