Graham Klyne wrote:
> I've done some implementation work recently that figures a URI from a
> filename and vice versa. For Unix-like systems, I think the correspondence
> is pretty clear, but for Windows I needed to engage in some guesswork about
> how to deal with device (drive) names. Part of my code looks like this:
> [[
> -- strip off leading '/' from Windows drive name
> source = fileuripath (path uri)
> fileuripath ('/':file@(d:':':more)) | driveLetter d = file
> fileuripath file = file
> driveLetter d = d `elem` ['A'..'Z']
> ]]
> That is, on windows systems, FILE://localhost/D:/dir/file is treated as a
> reference to file D:\dir\file on the current host system. But other
> software I have seen in the past uses '|' in place of the ':'. I'm not
> sure what is the current preferred approach.
This topic is worthy of a separate thread, so I'm spinning it off now.
If we could get more implementers to agree on best practices for converting
OS-specific filesystem paths to URIs *and* vice-versa, it would be a Good
Thing.
First, a couple of references:
RFC 1738:
http://www.ietf.org/rfc/rfc1738
Summarizes the format of several URL schemes, two of which have been
made obsolete by RFCs 2616 (HTTP/1.1) and 2368 (mailto). Also provides
generic URL syntax and related rules that have been superceded by RFCs
1808 and 2396. The 'file' scheme defined here is still in effect.
RFC 1738bis (current draft):
http://www.ietf.org/internet-drafts/draft-hoffman-rfc1738bis-02.txt
(I think the date in it is supposed to say April 19, 2004, not 2003,
since the previous draft was dated October 2003). An attempt to
preserve and update the URL scheme summaries that have not been made
obsolete by RFCs 2616 (HTTP/1.1) and 2368 (mailto).
RFC 2396bis (current draft):
http://www.gbiv.com/protocols/uri/rev-2002/rfc2396bis.html
I think we are all familiar with this one.
Second, the 'file' URI scheme as defined in RFC 1738 leaves much up to the
interpreter. It might be argued that it has no choice but to leave things
ambiguous, because IETF RFCs apply only to the Internet, while the 'file'
scheme is defined as a non-Internet protocol. It might very well be beyond the
scope of an RFC to mandate how to derive a URI from an OS-specific path and
vice-versa (I've no idea if this is the case, I'm just saying...)
Third, things are not as straightforward as you suggest, even in Unix.
When converting from any filesystem path to a URI,
questions to consider include:
- For what kind of filesystem / OS is the path?
- Windows, MS-DOS
- Unix / POSIX (Linux, FreeBSD, Solaris, Mac OS X, Cygwin, etc.)
- legacy Mac OS (OS 9 and prior)
- (etc.)
- If the path's filesystem / OS is not given, what do you do?
- assume the path is appropriate for the local OS?
- reject the path?
- If the path's OS is unsupported, what do you do?
- reject the path?
- use a default algorithm, like just prepending 'file:' and
percent-encoding as required?
- Is the path 'absolute'?
- If it's a UNIX path, whether it starts with "/" is the only
qualification, I believe.
- If it's a Windows path, it could be absolute if it matches the
regular expression ^(\\|[A-Za-z]:) - that is, it either starts
with "\" or a drivespec (an ASCII-range letter followed by ":").
- If the path is not absolute, e.g. it looks like 'the/path',
- reject it?
- create a relative URI reference? ('the/path')
- create an RFC 2396bis-compliant, but RFC 1738-offending,
URI like 'file:the/path'?
- attempt to make the path absolute by interpreting it to be
relative to the local host's 'current working directory', if
such a concept exists in the local OS?
What if the path is for some other OS?
And do you make it absolute according to the OS's conventions
first, or do you do an RFC 2396bis conformant resolution of
a relative URI reference ('the/path') against the base URI
that is derived from the current working directory?
- Do you attempt to collapse dot segments (or equivalent) in the
path or in the resulting URI? Does it depend on whether the path
or URI is absolute? A reason to collapse dot segments in an
absolute URI is so that the URI can be suitable for use as a base
URI for RFC 2396bis conformant resolution.
- Is the mapping between segments in the filesystem path and
segments in the path component of the URI well-defined?
- On Unix, it should be sufficient to percent-encode all
non-unreserved characters. Note that '/' may appear *within*
a segment, though (you can put a slash in a filename), so be
sure to apply percent-encoding to each segment individually.
- On Windows,
- If the path purports to be for a particular OS, but does not match
that OS's syntax for a path, e.g. 'C:/autoexec.bat' on Windows,
- reject the path?
- be as lenient as possible, e.g. replace '/' with '\' for Windows?
What about '9:\autoexec.bat' on Windows (bad drivespec)? acceptable?
- If the path is provided as a sequence of Unicode characters,
- form the URI by leaving unreserved characters as-is, and percent-
encoding the rest, using UTF-8 as the basis? (RFC 2396bis default)
- use some other encoding more appropriate to the path's OS?
- If the path is provided as a sequence of bytes, not Unicode characters,
with no additional info about encoding,
- reject it because it can't be decoded to Unicode?
- assume a default encoding?
- based on...? How confident can you be about, say, a filesystem
default encoding? (probably not very)
- attempt no decode; just form the URI by converting to unreserved
characters only those bytes that, when decoded as ASCII, correspond
to unreserved characters, and percent-encoding the rest of the bytes
individually?
- For a Windows path, is it in the form of a local path or a UNC path?
("local" may not be the right term)
- local, absolute, with drivespec: C:\autoexec.bat
- local, absolute, no drivespec: \autoexec.bat
- local, relative: the\path
- UNC: \\host\share\autoexec.bat
- Do you map the UNC host name to the authority component?
Don't forget to percent-encode.
- Do you leave the UNC share name as the first segment of the path
coponent, or..? And don't forget to percent-encode.
- Networked instances of Windows do weird things like refer to network
printers like this: '\\http://192.168.0.1/printername', and refer to
shared drives like this: '\\sharename\$d$\autoexec.bat'. When are these
conventions used? I saw the former today, and the latter a few years
back on NT4 systems. Are they documented anywhere, and do you want to
attempt to deal with them?
- For a Windows path, do you do any case normalization, e.g. in the
drivespec? ('c:' -> 'C:')
- Windows uses ":" in the drivespec (and nowhere else, currently).
":" is a reserved character in a URI, but does not need to be
percent-encoded in a path segment. Therefore, 'file:///C:/autoexec.bat'
is acceptable as a URI, and is equivalent to 'file:///C%3A/autoexec.bat.
There is a convention of using "|", e.g. 'file:///C|/autoexec.bat', I believe
because of the ambiguities that arise when you have situations like 'C:/foo'
as a relative URL being resolved against, say, 'file:/autoexec.bat' or
'file:C:/autoexec.bat' and so on - things that appear in the wild and may(?)
have been canon at one time, but don't play nicely with any relative
resolution algorithms.
I haven't much sympathy for "|" and feel it should be deprecated as much
as possible. Resolvers should continue to accept it and treat it as
synonymous with a drivespec ":". On that note, though, should they treat
*all* "|" as ":", or just those that appear to be a drivespec?
If ":" or "|" ever become legal characters in Windows paths... then what.
- Empty segments in the path: collapse them? Depends on OS?
(gets tricky round-tripping on Windows with UNC paths.. I'd have to
experiment again to give you some good examples though. I decided not to
worry about it too much).
That's all for now, and that's only touching on the conversion *to* a URI,
for just two OSes. The conversion from a URI to an OS path is even worse...
-Mike