Metadata, The Mac, and You

Given the previous discussion of common
file metadata storage implementations, and retaining the last shreds of
your "fundamental concepts" thinking cap, how would you expect file type
metadata to be stored?

We've already seen how many different types of metadata are stored.
We've seen how essential
metadata, both immutable
(size) and independent (name
and location), are necessarily woven into the fabric of file system
implementations. We've also seen how independent, non-essential metadata like
file permissions and file dates is stored in dedicated metadata
structures in the file system.

Now we've got a piece of immutable metadata that's
technically non-essential,
but that is of great importance to users' interaction with files.
Before considering storage implementations for file type, let's examine
why this piece of metadata is so important.

File type is classified as non-essential metadata because the file's
data can be stored and retrieved without reference to the file type.
The data itself is what we're really interested in, after all. The
operating system may make decisions based on other available metadata
(checking permissions for access rights, checking dates when running
backups, etc.), but manipulating a file's contents requires only the
data itself, located with the combination of the file location and name,
and read based on the file's extent and data traversal path.

File type enters the picture when a user decides to manipulate a
particular file directly. In today's dominant computing paradigm, an
application is required if the user wants to view or edit a file. The
application itself may only need the file's data, but choosing
which application to use depends on the file's type, format,
content type, or whatever you want to call that actual nature of the
bunch of bits that compose the file's data. Is it an image? If so, an
image editor application may be a logical choice. If it's an audio
file, a different application may be more appropriate, and so on.

The user may choose the application himself (e.g. opening the file
from within an application), in which case the file type must be
available to the user if he is to know which file to open in which
application. In the GUI paradigm, the process of choosing which
application to use to manipulate a particular file (often called
"application binding") can also be handled by the operating system. The
user simply indicates his desire to open a file (by double-clicking the
file, traditionally) and the operating system looks at the file's type
and chooses an appropriate application.

It's useful to examine exactly what can be stored regarding file
type. Broad file types like "image" or "audio" are useful for
organizational purposes, but when it comes down to an application
reading a file's data and correctly interpreting it, more specific file
type metadata such as "JPEG" or "WAV" becomes necessary. In some cases,
even more detail may be required. Just identifying a file as a
"Microsoft Word document", for example, might not be enough to determine
if a particular version of Word can open this file.

So before we even consider storage implementation, we must decide
what, exactly, we're going to store. In the case of file dates, the
only real choice is that of resolution: days, seconds, or milliseconds.
File size is a similar situation: blocks, bytes, or bits. File names
and locations will only vary by length and possibly encoding (ASCII,
Unicode, MacRoman, etc.) Permissions and ownership metadata is
determined by the security model of the OS: user/group id numbers,
permission bit masks, access control lists, etc. But there is a
tremendous range of possible storage formats for file type
metadata.

In practice, the data stored is usually somewhere between the more
general "image" and the very specific "Photoshop 3.0 document." Given
this level of accuracy, reasonably intelligent decisions can be made
about which applications can read and understand a particular
file.

Now that we've decided what file type data to store (in the broadest
sense, anyway), we can finally consider where to store this data.
Again, to refresh your memory, file type is immutable, non-essential
metadata that plays a particularly important role in the user interface.
We've seen some immutable metadata (size) woven into the fabric of the
file system, and another piece (modification date) stored in the
dedicated metadata structures of the file system. Independent,
non-essential metadata (file permissions, creation date, etc.) have also
been stored in the dedicated metadata area. Where should we store file
type?

In the earliest implementations of file systems that stored file type
metadata, it was stored, like all other metadata, in a distinct, but
usually very small (only a handful of bytes, if that) file system
structure. The size constraints were a factor of the cost of memory and
disk space in those days, and that necessarily affected the clarity of
the file type metadata it its raw form.

This is not necessarily a problem, however, since other pieces of
metadata are similarly constrained. File ownership metadata may be
stored as an inscrutable user id, for example, but that does not mean
that the user ever has to read user id "157" and remember that it
corresponds to the user "sally." Anyplace the user is likely to see a
representation of file ownership metadata, the operating system looks up
and displays the text "sally" in place of the user id that is actually
stored in the file system.

But early operating systems usually did, in fact, display file type
metadata exactly as it was stored: most often as a handful of characters
like "TXT" or "COM". Humans are reasonably good at using mnemonic
devices to map from obscure or truncated pieces of information to more
verbose representations. Remembering that "TXT" means "text file" is
much easier that remembering that "157" is "sally." Moreover,
displaying file type metadata "as stored" saved memory, CPU, and
programmer effort that would have been necessary to do a more verbose
expansion. So while file type metadata storage remained distinct, the
information was displayed in its raw form.

Subsequent operating systems incorporated file type metadata as a
third component of the file identifier (along with location and
name). In order to specify a file completely, it was necessary to
provide the file's location, its name, and its type. This also
meant that several files could share the same name and location,
provided they had different types. The solution to this potentially
confusing situation was to simply combine file type metadata and file
name metadata during both editing and display by joining them with a
delimiter of some kind (usually "."), effectively nullifying the tacit
storage separation of file type metadata. And so, file name extensions
were born.

Think back to the introduction of this section when you were asked
how you'd store file type metadata. Was your first thought to store
file type metadata by encoding it in the file's name, delimited by a
character chosen from the same character set as the file name itself?
If so, do you think your decision was influenced by thoughts of existing
implementations?

None of the decisions made during the process that led to the
creation of file name extensions seem unreasonable. The abbreviated
nature of the raw metadata was dictated by the storage constraints of
the day. The choice to expose file type metadata in its raw form was
made based on people's ability to deal with the chosen data format
(mnemonic abbreviations). The eventual combination of the display and
editing of file type metadata and file name metadata was a decision
that seemed to flow naturally from the constant side-by-side display of
file names and types in directory listings.

And yet look at the end result: a piece of immutable metadata is
combined with a piece of independent metadata (effectively, and
eventually literally) in a single storage location, delimited by
"in-band" data. File name extensions have been described as a "hack",
meaning an expedient (and often clever) solution to a problem that
cannot be solved as well or as quickly by other means. I disagree with
this description.

File name extensions did not solve an existing implementation
problem. File type metadata already had a dedicated storage
location in the file system. There was no implementation constraint
that necessitated the incorporation of file type metadata into the file
identifier, and there was no implementation constraint that necessitated
the encoding of file type metadata in the file name. Doing so did not
solve any existing problems, and actually caused many new ones of its
own (as we'll see shortly). The creation of file name extensions was
not a hack. It was a mistake.