NTFS and Unicode?

by Michael S. Kaplan, published on 2006/09/24 15:40 -04:00, original URI: http://blogs.msdn.com/b/michkap/archive/2006/09/24/769540.aspx

A recent question I received via email from a colleague who preferred to remain anonymous on the blog:

Hope everything is going well with you first of all...

May I ask for your help on an NTFS technical question? I'm currently involved in some CIFS/NTFS compatibility related issue discussion and wondering what would be the first Windows release that supported UTF-16 and characters of beyond the BMP area?

Would you please let me know if you have the info handy or point me to one of the public documents available at Microsoft web sites? (I was trying to do web search but I wasn't really able to find the info from www.microsoft.com...)

Thanks very much in advance for your help and hope this isn't a trade secret that I'm asking for...

Well, since as far as I know I don't know any trade secrets about NTFS, we are probably safe on that count, at least! Just to make sure, I'll stick to stuff that anyone can verify themselves if they want.... :-)

Of course there is the info I just put up in this blog post for starters, and I'll go a step further and make it clear that you can use high surrogate and low surrogate code units in NT even before they were actualy defined (since none of the current or past incarnations of NT disallow unassigned code points).

The Wikipedia article is really quite misleading on this score with its text:

File names are stored in Unicode (encoded as UTF-16, although limited to the Basic Multilingual Plane in early versions before Windows 2000).

Well, I'll point out that whoever wrote this bit either confused NTFS with Active Directory (which is actually limited on this point until Windows XP/Server 2003 which is when surogate code units first received weight) or they simply don't understand NTFS and did not test creating such files on NT 4.0 or earlier.

In my ideal world, a future version of NTFS would actually (optionally) take into account both characters defined in Unicode and also Unicode normalization, but as far as I know there isn't anyone planning such a thing yet.

So if I absolutely had to describe NTFS in terms of a Unicode version, I'd say it uses a very early version of Unicode and it assumes that anything it believes to be unassigned code points it allows for forward compatibility. :-)

and these two too (encoded in UTF-16, of course):
file 1: U+10400
file 2: U+10428
I was hoping that the last example might work in Vista, because NTFS file "$UpCase" in Vista is different from the one in Windows XP.

I'm a bit surprised Wikipedia is used as a reference; don't you guys have an internal wiki or something? IIRC Ward Cunningham himself works for the company. If if there's no wiki I would have thought that there would be an internal website that had info on it or something...