Understanding Files

As far as a Java program knows, a file is a sequential set of bytes stored on a random access medium such as a hard disk or CD. There is a first byte in the file, a second byte, and so on, until the end of the file. In this way, a file is similar to a stream. However, a program can jump around in a file, reading first one part of a file and then another. This isn't possible with a stream.

17.1.1. Filenames

Every file has a name. The format of the filename is determined by the operating system. For example, in DOS and Windows 3.1, filenames are 8 ASCII characters long with a 3-letter extension. README.TXT is a valid DOS filename, but Read me before you run this program or your hard drive will get trashed is not. All ASCII characters from 32 up (that is, noncontrol characters), except for the 15 punctuation characters (+=/][":;,?*<>|) and the space character, may be used in filenames. Filenames are case-insensitive (though generally rendered as all capitals). README.TXT and readme.txt are the same filename. A period may be used only as a separator between the 8-character name and the 3-letter extension. Furthermore, the complete path to the file, including the disk drive and all directories, may not exceed 80 characters in length.

On the other hand, Read me before you run this program or your hard drive will get trashed is a valid Win32 (Windows 95 and later) filename. On those systems filenames may contain up to 255 characters, though room also has to be left for the path to the file. The full pathname may not exceed 255 characters. Furthermore, Win32 systems allow any Unicode character with value 32 or above in filenames, except /*<>:?" and |. In particular, the +,;=][ characters, forbidden in DOS and Windows 3.1, are legal in Win32 filenames.

Win32 also makes short versions of the filename that conform to the DOS 8.3 format available to non-32-bit applications that don't understand the long filenames. Java understands the long filenames and uses them in preference to the short form.

Read me before you run this program or your hard drive will get trashed is not a valid Mac OS 9 filename because on Mac OS 9 file and directory names cannot be longer than 31 bytes. Volume names cannot be longer than 27 bytes. However, there's no fixed length to a full path name. The exact number of characters allowed in a name depends on the number of bytes per character used by the local encoding. Read me or your HD will be trashed only contains 27 bytes in most encodings and is thus a valid Macintosh file, directory, and volume name. Mac OS 9 filenames can contain slashes and backslashes (unlike Windows filenames) but may not contain colons. Otherwise, any ASCII characters, as well as 8-bit MacRoman characters like ® and p, can be used in a Mac filename.

Of course today most Mac users are running Mac OS X, which is a version of Unix. Just as Windows converts names to 8.3 filenames as necessary to support older applications, so too does Mac OS X convert really long filenames to shorter ones for older apps. Java programs running on Mac OS X only see the longer Unix style names.

Pretty much all modern Unix systems including Linux and Mac OS X allow at least 255 characters in a filename, and none of those 255 characters needs to be left for a path. Just about any ASCII character except the forward slash (/) and the null character (ASCII 0) are valid in a Unix filename. However, because Unix makes heavy use of a command line, filenames containing spaces, single quotation marks, double quotes, hyphens, or other characters interpreted by the Unix shell are often inconvenient. Underscores (which aren't interpreted by the Unix shell) are safe and often used in place of problematic characters (for example, Read_me_or_your_HD_will_be_trashed.)

Character sets are an issue for filenames too. Some Unixes use ISO 8859-1, some use ASCII only, and some use Unicode. Worse yet, the names of the files can change from one user to the next depending on how they've configured their locale. American Mac OS 9 filenames are given in the 8-bit MacRoman character set, but internationalized versions of the Mac OS use different character sets. Mac OS X uses Unicode throughout. However, some bugs in Apple's Java implementation prevent it from reading or writing files whose names contain characters from outside the Basic Multilingual Plane. Windows 95 and later, fortunately, use Unicode exclusively, and it pretty much works. However, the reliable lowest common denominator character set for filenames is still ASCII.

Case sensitivity is a problem too. Readme.txt and README.TXT are the same file on Mac OS 9 and Windows but represent two different files on Unix. Mac OS X is basically Unix, but in this respect it's actually more similar to Windows and the classic Mac OS. Mac OS X filenames are case insensitive. (Actually case sensitive filenames are an option when a disk is formatted, but the default that almost everyone uses is case insensitive.)

Handling different filename conventions is one of the difficulties of doing real cross-platform work. For best results:

Use only printable ASCII characters, periods, and underscores in filenames.

Avoid punctuation characters in filenames where possible, especially forward and back slashes.

Never begin a filename with a period, a hyphen, or an @.

Avoid extended character sets and accented characters like ü, ç, and é.

Use mixed-case filenames (since they're easier to read), but do not assume case alone will distinguish between filenames.

Try to keep your filenames to 32 characters or less.

If a filename can be stored in a DOS-compatible 8.3 format without excessive effort, you might as well do so. However, Java itself assumes a system on which files have long names with four- and five-character extensions, so don't go out of your way to do this.

17.1.2. File Attributes

Most operating systems also store a series of attributes describing each file. The exact attributes a file possesses are platform-dependent. For example, on Unix a file has an owner ID, a group ID, a modification time, and a series of read, write, and execute flags that determine who is allowed to do what with the file. If an operating system supports multiple types of filesystems (and most modern desktop and server operating systems do), the attributes of a file may vary depending on what kind of filesystem it resides on.

Many Mac files also have a type code and a creator code as well as a potentially unlimited number of attributes that determine whether a file is a bundle or not, is an alias or not, has a custom icon or not, and various other characteristics mostly unique to the Mac platform.

DOS filesystems store a file's last modification date, the actual size of the file, the number of allocation blocks the file occupies, and essentially boolean information about whether or not a file is hidden, read-only, a system file, or whether the file has been modified since it was last backed up.

Modern versions of Windows support multiple kinds of filesystems including FAT (the basic DOS-compatible filesystem) and NTFS (NT File System). Each of these filesystems supports a slightly different set of attributes. They all support a superset of the basic DOS file attributes, including creation time, modification time, access time, allocation size, file size, and whether the file is read-only, system, hidden, archive, or control.

Any cross-platform library like the java.io package is going to have trouble supporting all these attributes. Java can read a fairly broad cross-section of these possible attributes for which most platforms have some reasonable equivalent. It does not allow you easy access to platform-specific attributes, like Mac file types and creator codes, Windows' archive attributes, or Unix group IDs.

17.1.3. Filename Extensions and File Types

Filename extensions often indicate the type of a file. For example, a file that ends with the four-letter extension .java is presumed to be a text file containing Java source code; a file ending in the five-letter extension .class is assumed to contain compiled Java byte code; a file ending in the 3-letter extension .gif is assumed to contain a GIF image.

What does your computer do when you double-click on the file panther.gif? If your computer is a Macintosh, it opens the file in the program that created the file. That's because the Mac stores a four-letter creator code for every file on the disk. Assuming the application associated with that creator code can be found (it can't always, though), the file panther.gif is opened in the creating program. On the other hand, if your computer is a Windows PC or a Unix workstation, the creating program is not necessarily opened. Instead, whichever program is registered as the viewer of .gif files is launched and used to view the file. In command-line environments, like the Unix shell, this isn't really an issue because you begin by specifying the program to run (that is, you type xv panther.gif, not simply panther.gif) but in GUI environments, the program that's opened may not be the program you want to use.

File extensions have the further disadvantage that they do not really guarantee the content type of their document and are an unreliable means of determining the type of a file. Users can easily change them. For example, the simple DOS command copy HelloWorld.java HelloWorld.gif causes a text file to be misinterpreted as a GIF image. Filename extensions are only as reliable as the user who assigned them. What's more, it's hard to distinguish between files that belong to different applications that have the same type. For instance, many users are surprised to discover that after installing Firefox, all their HTML files appear to belong to Firefox instead of Internet Explorer.

The Macintosh solved this problem over two decades ago. Almost every Mac file has a four-letter type code like "TEXT" and a four-letter creator code like "R*ch". Since each file has both a type code and a creator code, a Mac can distinguish between files that belong to different applications but have the same type. Installing Firefox doesn't mean that Firefox suddenly thinks it owns all your Internet Explorer documents. Software vendors register codes with Apple so that companies don't accidentally step on each other's toes. Since codes are almost never seen by end users, there's not a huge rush to snap up all the good ones like "TEXT" and "HTML". Overall, this is a pretty good system that's worked incredibly well for more than twenty years. Apple actually tried to get rid of it in favor of Unix/DOS style file extensions when they moved to Mac OS X, but backed down after massive outcries from developers and users alike. Neither Windows nor Unix has anything nearly as simple and trouble-free. However, because Windows and Unix have not adopted Mac-style type and creator codes, Java does not have any standard means for accessing them.

The com.apple.eio.FileManager class included with Apple's port of the JDK 1.4 and 1.5 provides access to Mac-specific type and creator codes and other file attributes. Steve Roy's open source MRJAdapter library (https://mrjadapter.dev.java.net/) provides this for almost every version of Java Apple has ever shipped.

None of these solutions are perfect. On a Mac, you're likely to want to use Photoshop to create GIF files but Preview or Firefox to view them. Furthermore, it's relatively hard to say that you want all text files opened in BBEdit. On the other hand, the Windows solution is prone to user error; filename extensions are too exposed. For example, novice HTML coders often can't understand why their HTML files painstakingly crafted in Notepad open as plaintext in Internet Explorer. Notepad surreptitiously inserts a.txt extension on all the files it saves unless the filename is enclosed in double quote marks. For instance, a file saved as HelloWorld.html actually becomes HelloWorld.html.txt while a file saved as "HelloWorld.html" is saved with the expected name. Furthermore, filename extensions make it easy for a user to lie about the contents of a file, potentially confusing and crashing applications. (You can lie about a file type on a Mac too, but it takes a lot more work.) Finally, Windows provides absolutely no support for saying that you want one group of GIF images opened in Photoshop and another group opened in Paint.

Some algorithms can attempt to determine a file's type from its contents, though these are also error-prone. Many file formats begin with a particular magic number that uniquely identifies the format. For instance, all compiled Java class files begin with the number 0xCAFEBABE (in hexadecimal). If the first four bytes of a file aren't 0xCAFEBABE, it's definitely not a Java class file. Furthermore, barring deliberate fraud, there's only about a one in four billion chance that a random, non-Java file will begin with those four bytes. Unfortunately, only a few file formats require magic numbers. Text files, for instance, can begin with any four ASCII characters. You can apply some heuristics to identify such files. For example, a file of pure ASCII should not contain any bytes with values between 128 and 255 and should have a limited number of control characters with values less than 32. But such algorithms are complicated to devise and imperfect. Even if you are able to identify a file as ASCII text, how would you determine whether it contains Java source code or a letter to your mother? Worse yet, how could you tell whether it contains Java source code or C source code? It's not impossible, barring deliberately perverse files like a concatenation of a C program with a Java program, but it's difficult and often not worth your time.