Filesystems

This is part of a series of articles that covers the
booting of an OSR5 machine. See Booting
OSR5 for other related articles.

Just about anything you are going to do with a computer involves
files. Unix in general treats just about everything as a file:
devices, directories, named pipes- they are all just files. Other
OS's don't necessarily do that; for example there's no visible /dev
directory on a Windows system-though if you are programming in "C",
for example, you can open "/dev/lp" and print to it just as you
would in Unix! Besides that, though Windows doesn't usually let you
have direct access to its devices: a tape backup program, for
example, knows how to write to your tape drive, but you'd have no
way to directly access it yourself as you do in Unix. The concept
of "everything is a file" is one of the distinguishing
characteristics of Unix systems. Files are created on filesystems,
filesystems are created on divisions (SCO's terminology) and
divisions are created within partitions.

Partitions

Partitions are created with "fdisk". That's true whether you are
running SCO, NT or Linux. You might use multiple partitions so that
you can "dual boot"- run more than one OS on the same computer. It
doesn't matter which OS goes in which partition (though it's
usually advisable to install Windows first), but one of the
partitions must be marked "active": that's the partition that will
boot by default. DOS allows four partitions, one of which can be an
"extended partition". Dos, Windows, and Linux make use of extended
partitions, SCO does not.

There are two reasons for installing Windows before Unix or
Linux. The first is that older versions of Windows were a little
stupid about where their partition actually ended, and would
sometimes scribble a little extra data beyond where they should
have. We hope they've fixed that by now, but the disk geometry
problem still remains:

One thing NOT to forget- a lot of advice suggests leaving root fairly small. That's OK in and of itself, but keep in mind that on OSR5, /tmp is NOT a separate filesystem by default, and many things make heavy use of /tmp for (duh!) temporary files. I've had numrerous "incidents" of mysterious failures on Linux systems from exactly this, and I've seen it a handful of times on SCO systems that I didn't install. Given the current cost of diskspace, I routinely give 2-4GB for root- it's massive overkill, but I then have one less thing I have to worry about- even my home Linux box has a 2GB root and is typically only about 40% used- which is fine by me; cheap insurance.

Disk Geometry

A hard drive is composed, of course, of cylinders, heads, and
sectors, but nowadays that is more fiction than reality. In other
words, the actual physical design of the disk is unimportant and
may not even be known, and it can be accessed in numerous different
combinations of cylinders, heads and sectors per track that will
calculate out to the proper size. The problem is that Dos/Windows
have certain specific ideas about what sort of numbers to use, and
that may not be what the Unix/Linux fdisk would choose in the
absence of other information. However, if the Dos/Windows is
installed first, then the Unix fdisk knows what geometry was used
and can match it.

Lost Stuff

Should you find yourself with a drive that has lost its concept
of geometry, you can force it by the "dparam" command: see "man
dparam". Boot programs can be
restored with "instbb". If you can't boot, of course, you'll need a
boot floppy or something else: see Emergency Boot.

Multiple Unix Partitions

You can create more than one Unix partition within fdisk, but
you generally wouldn't need to unless you will need more
filesystems than will fit in the first partition. Each Unix
partition can be divided into up to 7 file systems, but since the
first partition usually has swap, boot and recover as well as root,
that only leaves 3 additional filesystems. If you will need more,
you'll need multiple Unix partitions (or, of course, just more
physical drives, each of which in turn can have 4 fdisk partitions
and 7 filesystems within each partition).

If you have multiple partitions, you'll need to run divvy on
them AFTER the installation. The command would be "divvy /dev/hd02"
for the second partition on your primary disk, "divvy /dev/hd12"
for the second drive, etc. See "man HW hd" for more examples. You'd
then create and name filesystems, and then run "mkdev fs" to finish
up. See Adding another hard
drive for more details.

Remember, if you are using divvy on a disk with existing data,
DO NOT use the (c) option to create filesystems. Simply name the
divisions you want to access. You have to use a mame that doesn't
already exist; you can't use "root"- but "oldroot" is fine. Also,
"mkdev fs" is not destructive in anyway- it will create and
populate /lost+found (see below) if necessary, but it will not harm
anything; it will just add the filesystem to /etc/default/filesys
so that it can be mounted. You can do this manually if you
wish.

Bela Lubkin summed up divvy in a newgroup posting:

From - Sat Sep 29 07:10:25 2001
Newsgroups: comp.unix.sco.misc
From: Bela Lubkin <[email protected]>
Subject: Re: divvy
Message-ID: <[email protected]>
References: <[email protected]>
Sender: [email protected]
Lodo Nicolino wrote:
> I would like to know where divvy write the information. Ex: block,
> filesystem name etc
Please see these Technical Articles:
http://aplawrence.com/cgi-bin/ta.pl?arg=106296
http://aplawrence.com/cgi-bin/ta.pl?arg=104384
http://aplawrence.com/cgi-bin/ta.pl?arg=107180
I'll add one thing, which is probably covered in one of those articles
but worth mentioning again: what you see when you run `divvy` is
actually a compendium of information compiled from three different
sources.
First, there's the divvy table on the partition. This tells us the
start and end block numbers of each division.
Second, `divvy` searches the /dev directory on the active root
filesystem, looking for device nodes whose major/minor numbers
correspond to those of the various divisions being looked at. For
instance, on an OSR5 root partition, /dev/root is usually device 1/42.
When you run divvy it does not find the string "root" in the division
table. It computes that the device number of division #2 on this
partition would be 1/42. Then it looks in /dev, notices /dev/root is
1/42, and displays "root" in the on-screen table. This is significant
because if you boot off a recovery floppy, it will only know the device
names of your divisions if their device nodes have been copied to the
floppy.
Third, it actually _reads_ the first few K bytes of each of the
divisions in order to comment on what _type_ of data is present. In the
case of 1/42, it opens /dev/root, reads a bit, and (in most cases)
determines that it's an HTFS filesystem. So it displays "HTFS". It
reads /dev/boot and learns that it's "EAFS"; it reads /dev/swap and
doesn't recognize it as any particular filesystem type, so displays "NON
FS".
When you _change_ division start/end points, divvy writes the new
information to the division table at the beginning of the partition.
When you change division _names_, divvy deletes /dev/oldname and
/dev/roldname and creates /dev/newname, /dev/rnewname with the right
device numbers.
When you "change the type" of a division in divvy, it has no effect.
Only when you also tell it to "create a new filesystem" does it do
anything. Then, when you tell it to act on your wishes (i.e. when you
q[uit], i[nstall]), it runs `mkfs` to create a filesystem of the
requested type. Assuming that succeeds, next time you enter divvy it
will show the new type. (This of course destroys the previous contents
of the filesystem; as would changing a division's start/end points. Be
careful while experimenting with divvy!)
>Bela<

File Systems

With a 5.0.x or Unixware install, you are going to have at least
two "real" filesystems: /stand and / (root). You'll also have
swap and recover, and possibly a
"scratch" filesystem. Both recover and scratch are related to
"fsck" which is discussed further below.

An HTFS filesystem can be as large as one terabyte. A Unixware 7
file system can also be that large. However, on OSR5 an individual
file is limited to 2GB, while Unixware allows a file to be 1 TB for
at least some purposes (not all Unixware commands can handle files
iover 2GB). Current Linux systems also limit file size to 2GB.

Reformatting drives to refresh the format

I don't think I've heard any mention of "low level format" in many a year now, but there was a time when that phrase was tossed around loosely and inaccurately.

This old post spelled out the reality of disk design and is interesting in a historical context.

This reminded me of another utility from that era - Gibson Research Spinrite. Hard drives have become so reliable that I was really surprised to see that Spinrite still sells that product! Their FAQ page has this note:

No software of any sort can truly low-level format today's modern drives. The ability to low-level format hard drives was lost back in the early 1990's when disc surfaces began incorporating factory written "embedded servo data". If you have a very old drive that can truly be low-level reformatted, SpinRite v5.0 will do that for you (which all v6.0 owners are welcome to download and run anytime). But this is only possible on very old non-servo based MFM and RLL drives with capacities up to a few hundred megabytes.

I hope no one does still have those drives!

Newsgroups: comp.unix.sco.misc
From: [email protected] (Bill Vermillion)
Subject: Re: How do I low-level-format an IDE Drive?
Message-ID: <[email protected]>
Date: Fri, 9 Apr 1999 15:38:56 GMT
In article <[email protected]>, Tony Earnshaw
<[email protected]> wrote:
>Frank Overstreet wrote:
>> ... Now I want to low level format the drive and am wondering
>> if the Western Digital utility wddiag.exe is what I need. When
>> the readme describes writing all 00's is that the same as a
>> low-level-format. If not please help.
>If you even attempt to low-level format a modern IDE disk, you'll ruin
>it. This has long been the case. Writing all 00s is not the same as a
>low-level format, it's what it says.
>Low level formatting was possible, after manufacture, with (now) almost
>prehistoric drives to remagnetize surfaces (thereby removing corruption
>and thus sometimes repairing some bad spots on the surfaces). These
>drives did not carry translation tables, as modern drives do.
The old wives tale of 'remagnetizing' the drives is just a myth.
Magnetic media is quite stable - it's the environment that does
them in re-acting with binders in coated media. Media for the most
part is plated/sputtered today, so the only problem is the decaying
of the particles. It just doesn't happen - at least in a computers
lifetime. This excludes catastophic events of course, and high
heat levels - above 150F you are going to have problems.
One of the ways the myth seemed to get started was on the old MFM
drives of the ST-506 heritage. These were all 'stepper' drives.
eg - a motor turned x degrees and ratchets the head across the
drive surface. (In the floppy arena it was typically to have
to re-aline a 5.25" disk every 6 months when used in heavy duty
service. I did that but was pushing them 24x7x365. The first
drives would last about a year, and when the 1/2 heights came out
you could expect 4 years approximately - MTBF was about 20,000
hours for those).
The mechanism would wear over time and when the drive was issued
commands to pulse/step the drive to the track, after a time it
the head would not be positioned exactly in the center of the track
set by the original format, and a reformat would then bring back
the performance as first seen as the platter to stepper were now in
sync with the worn portions.
To try to improve performance embedded servos were being used.
This was a servo burst in between sectors. Doing a real low-level
format meant the drive had to go back to the factor for a new
format and servo. It was expensive. Typically the servo looked
like a 'wedge' if you viewed it magnetically as the outer tracks
had the bits spread further apart.
Then came the dedicated servo drive - with the bottom platter
being used only for servo. This is why you'd see drives
with and even number of platters, but one less than the total
for data.
These are the drives that perform the thermal recal because as the
enivornment changes the metals contract and expand and the bottom
head is controlling the position of all other heads on the stalk.
Current technolgy is embeded servo again - but there's no way a
user can screw these up - as the old drives were controlled by
cards external to the drive, and the new ones are integral to the
units.
This eleminated thermal recallibration, ZBR (zone bit recording)
gives a different number of sectors available on different track
groups.
Low-level reformating really needs never to be done. Worst case -
to get rid of some pesky droppings by some ill-behaved program, or
programming concept, would be the destructive verify in the
controller.
But 'reformatting to refresh the format' is something left over
from DOS circa 1985.
--
Bill Vermillion bv @ wjv.com

The 1024 cylinder limit

Typically, the BIOS can only access the first 1024 cylinders of
the drive. This is not a problem once Unix is up and running, but
it is the built in BIOS that starts the boot process is used for
the first stages , so it's usually
important to be sure that the /stand filesystem will always fall
under that limit. Remember, this is a limitation of the BIOS, not
of SCO Unix.

In most cases nowadays, that's where I leave it, a small /stand
and the rest of the disk or partition as one big file system.

File System Sizing

There are, however, reasons to have separate filesystems:

Space

There isn't space on the hard drive to fit
everything you need. This, of course, was very common in the days
of small hard drives, but seldom is an issue nowadays. It certainly
isn't likely to be an issue during the installation- nothing you'd
install initially requires anything but a fraction of even a 4 gig
disk, and you are probably using something even larger.

After the install is a different story, of
course. You may want to install all kinds of things that take up
all kinds of space, and you may need or want another drive to hold
it all. For example, you might want to install SKUNKWARE. Normally
this installs to /var/opt/K/SKUNK98 (or SKUNK99, etc). You could
force that to go elsewhere by making /var/opt/K/SKUNK99 a symbolic
link pointing to another filesystem. See Secondary drives and Disk space for more ideas about
that.

Related to that is the situation where you WANT
the data on a separate drive for performance reasons, or because
the other drive is going to physically removable, etc.

You want to control how much data gets put on a
drive. For example, in some environments, I'll make
/var/spool/lp/temp a small filesystem of its own. This causes it to
fill up if there are too many unfulfilled print jobs, which calls
attention to the problem before it really gets out of hand and
fills up something more important. The idea here is that it's
better not to print than not to work at all.

A similar concept might apply to temporary
directories, but keep in mind that the booting system is apt to
need those too, particularly /tmp and possibly even /usr/tmp, so
those need to be available during boot and in single user mode.

Backups

You can't fit an entire drive on your backup
media, and want to keep volatile data on separate filesystems to
make it easier to use archaic backup programs like "dump". This is
unlikely to be a problem nowadays except for the very largest
systems. Unless it is simply impossible to do it because of time or
medium constraints, you should ALWAYS be backing up everything
every day. If you can't do that, you still probably do not want to
be using programs like dump that depend on organization into
separate filesystems. Modern backup utilities let you easily
specify what to include and what to leave out. While on this
subject, I'll mention in passing that while incremental backups
(backing up data that has been modified today) are tempting, they
are not fun when it comes time to restore. Aside from the physical
annoyance of having to restore multiple tapes, you are also likely
to end up restoring data that was actually deleted and should not
be restored. See /Reviews/supertars.html

Upgrades

You want to be able to do upgrades or reinstalls
and leave some filesystems untouched. This remains a valid reason
for separating certain areas from others, but with the speed and
capacity of modern backup systems, it is hardly as compelling as it
used to be.

Fsck

You want to be able to clean the filesystem in
the event of a crash. Older large filesystems, took a long time to
clean and fsck needed more memory than was likely to be present, so
it would need scratch files, which slowed it down further. It was
not at all unusual for these filesystems to get confused for no
particular reason; not from a crash, just because, so it was
obviously better to clean one or two small filesystems now and then
as opposed to having to clean one big filesystem every time this
happened. Linux filesystems still have that mentality, btw, and
will automatically run fsck after x number of boots and/or x number
of days. Older Sun filesystems would run it on EVERY boot. Modern
filesystems very seldom need to run fsck anyway, so this is not an
issue. I imagine it won't be an issue for Linux, either, once it
catches up in this area.

Partial drive failure

You want to contain the damage. On older systems,
it was often observed that if you had physical or electronic
damage, it was sometimes unrecoverable by fsck, but that it was
very apt to be confined to one filesystem. Therefore, spreading the
filesystems out made it more likely that more of your data survived
a crash. Again, this is unlikely to be an issue with modern
filesystems, and as we tend to back up more data more often from
and to more reliable media, it's even less important.

Treat data differently

John Dubois pointed out something I hadn't
thought of

Using separate filesystems for certain data also means that you can use mount
options appropriate for that data - tmp, nolog, ronly, etc.

-

None of these issues may apply to you, and if they don't there's
nothing at all wrong with one large root filesystem.

But you can't..

Different OS's have different ideas about what needs to be
available as the system boots. OSR5, for example, is very fussy: it
must have /usr and /tmp- which means these CANNOT be separate
filesystems. People accustomed to other Unix versions sometimes get
themselves in trouble here. With OSR5, don't relocate system
directories elswhere- if you want to make /u, /home, or /users (not
/usr), that's fine- but leave the system stuff on the root.

On Unixware 7, /tmp is usually a ram disk. Linux doesn't mind
you putting /usr elsewhere because it keeps its necessary programs
in /sbin. In short, you have to know what's important and why, and
very often the install manuals don't help you very much.

Crashes and Other Problems

Modern file systems are very immune to damage. Although you
should shutdown properly (man shutdown) before powering off, and
should have your machine attached to a UPS so it doesn't go off
unexpectedly, it is very likely that absolutely nothing bad will
happen if shutdown isn't run.

FSCK

When there is a crash, or just a power off, the filesystem will
probably be marked "dirty", meaning that there was data in memory
that had not yet been flushed out to the disk when the system went
down. On older systems (and on Linux through at least December
1999) that situation required running "fsck" to check and repair
the damage.

Repairing is what "fsck" does, and it does it incredibly well.
It needs to be run on an unmounted filesystem, or if that's
impossible (root is always mounted), in single user mode. NEVER RUN
FSCK on root in multi-user mode. You'll very likely cause
irrecoverable damage. Think about what's happening here: fsck is
reading through all the data and directories on the disk, trying to
make sense of everything, and at the same time, some daemon running
in multi-user mode is changing things! You'd be lucky not to really
screw things up, and how often are you that lucky?

If you have a large filesystem and a small amount of memory,
fsck may need a "scratch" file. That scratch file may have already
been created for you during the original install, but fsck doesn't
necessarily know to use that automatically. There's a flag you can
add to /etc/default/filesys that will tell fsck to use that, but
you would have had to add that yourself (see "man fsck"). Most
likely, though, you've been ambushed- you are booting a crashed
system, fsck says it wants a scratch file, and you don't know if
you have one. What do you do? If it's the root file system that
fsck is working on, throw a floppy in (it doesn't have to be
formatted for Unix but it does need not to have bad blocks) and
tell it to use /dev/fd0. If it's a secondary filesystem, you can
just tell it the name of a file on the root filesys- /tmp/scratch
is fine.

The "recover" filesystem mentioned above is used by fsck to save
its output in the case of an autoboot after a crash where it runs
automatically (assuming that you've said that's OK in
/etc/default/boot). You can see this data get picked up in
/etc/rc.d/9/reserved.

The lost+found directory (each filesystem needs its own
/lost+found) is used by fsck as a place to put files that are valid
(have valid disk blocks allocated and a non-zero reference count)
but somehow aren't listed in any directory. Ordinarily, these would
be temporary files caused by a common programming trick of opening
a temporary file, and then immediately removing it. The file
remains "open" for the program to read and write to, but the data
blocks will be immediately removed when the program exits- unless,
of course, the system crashes. So, normally, things you find in
lost+found won't be anything you need or care abbout, but in the
rare case of more serious problems, these could be. If you can
identify what the files are (they'll only be identified by their
original inode number) you can move them back where they belong. If
an entire directory ends up in lost+found, then the files within it
will have names (because, of course, the names are stored in that
directory) and that may be helpful in determining where the
directory belongs.

It is fairly rare to see fsck needed on modern HTFS filesystems.
This is because the driver keeps logs of work it needs to do with
regard to data blocks that need to be written, and it can quickly
read those logs and determine exactly what, if anything, needs to
be done. In fact, fsck doesn't usually do any more than examine
those logs- if you need to force it to really do its work, you'll
need to run it as "fsck -ofull".

In the rare event that there is damage that isn't fixed
automatically, you may need to run "fsck -ofull" manually. In such
situations, consider adding the "-y" flag to give default answers
to any and all questions that fsck may ask. It is unlikely that you
have better knowledge or ability to fix any problem and in fact
answering "n" to any question is just going to leave you with
unresolved issues that you probably need exotic knowledge and lots
of experience to fix manually (with fsdb or something like it). For
most of us mortals, the best choice is to let fsck make its own
decisions.

FSDB

The File System Debugger is "fsdb". As alluded to above, someone
with deep expertise could repair a damaged filesystem by hand using
it. Real use of this requires intimate knowledge of filesystem
design, but even those of us who'd rather not know that much about
the internals will find it useful now and then. For example,
suppose you suspect that a certain large file is excessively
fragmented. To find out, you just needs its inode number (get that
from "ls -li") and what filesystem it is on. Lets say it is inode
number 54618 and it is on the root filesystem:

echo "54618i" | fsdb /dev/root

That dumps the entire inode data, including the locations of the
data blocks. To follow these blocks all the way, you need to
understand their structure; see the article bfind.c: Finds which file contains a block
for an introduction to that. Steve Pate's Unix Internals tells a more complete
story of OSR5 filesystems.

If you want to play with fsdb, do so on a disposable filesystem.
You could use a floppy, or spare division or partition if you have
one.

Speaking of fragmentation, it seldom makes swense to "defrag" a
modern filesystem. In the first place, it's much less likely to be
fragmented because of its design. But more importantly, if this is
a typical multi-user system, consider that disk block requests are
coming from all the different users, and are likely to be scattered
all about the drive anyway. The OS and sometimes the underlying
hardware will do the best they can to arrange those requests for
the best access, but since the users will have different requests,
it isn't likely to help much to have files in sequential blocks. If
you still feel you need to defragment your drive, use one of the
Supertars to do it- just wipe
everything out and restore it- everything will be sequential.

Mounting

Other than low level tools like fsdb, you generally have to
mount a filesystem to have any useful access to it. That can be
done manually, but it's usually easier to have it done
automatically from /etc/default/filesys. You can add entries
manually, or you can let "mkdev fs" do it for you. One disadvantage
of "mkdev fs" is that it doesn't allow you to add optional flags
you may want (such as to specify a scratch file- see "man filesys"
and "man mount" for other options). Nothing says you can't use
"mkdev fs" and then go in by hand to add what you need, though.

When a filesystem is listed in /etc/default/filesys, you can use
a shorthand form of "mount". For example, suppose I had this entry
in /etc/default/filesys:

and when I just say "mount /jazdrive" as above, "mount" will
figure out what sort of filesystem I have and mount it.

A damaged file system won't mount until it has been cleaned with
fsck, but if fsck won't do it, you might be able to mount it
read-only and at least recover your data with a quick backup. It's
certainly worth a try.

CDROM's

SCO Unix had a long history of CD confusion. Originally they only supported SCSI cd's, but even those had issues.

IDE support became available, but it was confusing to configure as the system did not auto-detect; you had to show it where the devices were connected. The IDE driver itself went through several iterations. See the "wd.delay" section of Booting OSR5 - Definitions for some of the EIDE issues.

Strange problems likely had something to do with a timing issue in the wd driver.

Even as late as Windows 2000 there were incompatibilities and the early versions of OS X didn't create Windows readable CD's by default.

If you want portability for older systems (and don't need file system specific metadata), just use flat ISO-9660.

.slog0000

On non-root HTFS filesystems, there's a very interesting and
unusual file that is usually invisible- or almost invisible, and
that's one oof the strange things about it. You can't see it unless
you are completely specific:

ls -l .slog0000

(remember, you only have this on non-root HTFS filesystems). If
you try wild cards, you won't see it. You can't read it, you can't
remove it- it's metadata- that is, it is supposed to be there for
the use of the file system drivers only (I have no idea why it
isn't on the root filesystem). Its stated purpose is to speed up
synchronous writes.

It is possible (under really strange conditions) for this file
to become corrupt and/or visible (visible under ordinary listings).
This can cause backup programs (which ordinarily wouldn't even
notice it) to complain that they cannot open it, or (depending on
the design of the program) even to fail entirely. The OSS497C and
OSS497D patches supposedly prevent this from happening on
OSR5.0.4/5.

If this does happen, the file system can be unmounted and then
remounted using an undocumented "-o noslog" argument. When mounted
in this way, the .slog0000 file can be removed, and when the
filesystem is remounted normally, it should be recreated.

Bad Blocks

Most modern drives are extremely reliable, and the better drives
will even automatically and transparently map out bad blocks so you
never even know anything happened. When they see that a disk block
is starting to fail, they copy data to a spare block, re-jigger
their internal tables so that any reference to the old block now
goes to to new, and lock out the old bad block forever. That can
all be transparent to you. For drives without that feature, you use
"badtrk", and like fsck, you use it in single-user mode. It lets
you scan your drive non-destructively and will try to recover data
from bad blocks.

I'd say with the current state of the art, it's OK to have a bad
block or two. If you are seeing more than that, though, you have a
defective drive. A quick way to check for unreadable blocks if you
cant't unmount right now is to use dd. For example, to check my jaz
media, I might do:

dd if=/dev/jaz of=/dev/null bs=1024k

If there are any unreadable blocks, I'll get error messages. But
"badtrk" is a better way to do this.

On two occasions, I have been asked [by members of Parliament], "Pray, Mr. Babbage, if you put into the machine wrong figures, will the right answers come out?"...I am not able rightly to apprehend the kind of confusion of ideas that could provoke such a question. (Charles Babbage)