Conversation

I am using an Airpot Extreme Shared disk. It will not scrape any TV-Shows having UTF8 characters in show name. This is because Apple-Disk store path/file names in UTF-8 decomposed manner which is not compatible the installed scrapers.

Perhaps it would be more useful to do the conversion in the file classes? You'll notice we already handle it in the utf8 to wide conversion functions for display in the UI. By handling it at the filesystem level, where we know the encoding, we don't have to handle it in places where we don't know the encoding.

What happens with NFS/SMB shares from AirportExpress to Linux machines? Does it transfer as 'normal' UTF8 at that point?

@Memphiz: what happens if Apple Express/OSX is serving over NFS to a Linux box? Do you get "corrupt" UTF8 filenames, or is Apple Express/OSX smart enough to serve stuff over NFS/SMB using UTF8 that others will understand?

I looked through iconv and could not find the feature UTF8-MAC anymore. So this is not an option. If You access drives with HFS+ directly or via AFS-Protocoll, Apple takes care that Your filenames, however You specify them, are decomposed. No problem here. However, if You copy files from HFS+ to ext4, their names will be decomposed. ext4 basically does not care. You can have in the same directory two files, which look as if they had the same file name, one precomposed and one decomposed. So the decomposed file cannot be accessed with precomposed file names. It is a big mess, introduced by Apple. I would under no circumstances manipulate strings, which are used to acces files. Path-strings should only be converted for queries somewhere else (Scrapers) or for display, since XBMC renders those decomposed string wrong in displays.

We use UTF-8-MAC on darwin for utf8 to foo conversion, so it's certainly there (see utils/CharsetConverter.cpp) Thus, things should be displayed correctly in the UI already, right? You could thus easily patch in an iconv conversion method. Indeed, you could use CCharsetConverter::utf8To() directly if you changed it to check for UTF8_SOURCE rather than "UTF-8"

I missed the UTF-8-MAC implemenation in CharsetConverter.cpp. So we do not need my routines. By the way, i looked half a day in internet for routines which do the job and did not find suitable...

I have just checked on my Mac. Yes, the decomposed file names are displayed correctly. Yet the scraper does not work on those of course. I use the very same Airport Extreme Disk on my RaspberryPi, here the file names show up incorrect. I would say, converting for display from NFD to NFC is not an OSX issue, but an issue on hosts, where OSX drives are mounted. So we should always decompose for display and searching.

As i mentioned, i copied files with those filenames down from my Airport drive via AFS to ext4, and the filenames on ext4 stay decomposed. So I infected also my ext4 drive. Basically You have to exspect them everywhere. All my music comes originally from iTunes / OSX. Lots of accented characters in there. Moved all of that down to my server on ext4.

@jmarshallnz: I have looked into both, CharsetConverter.cpp and libiconv source. UTF-8-MAC in CharsetConverter is only a define used on Darwin build targets. It is not available anywhere else. Basically I cannot find a routine, which will convert UTF-8-MAC to UTF-8. By the way, decomposed characters are also valid characters in UTF-8. So if You convert Mac-Style-UTF to ISO-Latin, it will work everywhere as such, no special facilities required.

Converting strings from UTF-8 to ISO-Latin, then converting it back to UTF-8 would do the job, but is certainly no option, because ISO-Latin is only good for European languages and will fail on other content.

Basically we have two problems here:

People using Mac Shared Disks (Airport Extreme NAS is a common example) or files from Mac Shared Disks on local filesystems will expierence unpredictable and hard to explain problems with scrapers (Content not found).

People using Mac Content in XBMC Gui other than on Mac will have character display problems in file lists and other places.

utf8To() will work fine as long as you remove the check (or better, replace the check) that the "to" charset isn't UTF-8. Instead, compare it to UTF8_SOURCE. That will convert from UTF-8-MAC to UTF-8 then on darwin builds, which is a start.

I have tried on Linux/Debian the first variant w/o success as far i remember. I will doublecheck this. Also, converting to wide may work, but, since decomposed characters is well formed UTF-8, i am afraid, the conversion routine sees no need in converting from decomposed to composed while converting to wide unicode and back. I will also check this and give feedback. Thanks for Your support on this.

This is a protocol from a ssh session on my RPI. "Test" is a directory, which resides on a EXT4 partition.
The content was copied over from Aiport Extreme NAS to this place. Its a directory called "Unsere Mütter, unsere Väter"

I tried to cd into into it.

The first cd, I typed the path in. Did not work.
The second cd, I copy and paste the path from ll listing. Did not work.
The third cd, i used file name completion, worked.

Conclusion: Linux / EXT4 is picky about how you specify accented characters. You need to match the NFD/NFC version exactly, or otherwise it will not work. AFP/HFP+ ist not picky about this, You can enter paths in both, NFC and NFD and it will work.

@jmarshallnz: I have double checked both suggestions You made. Both conversions leave NFD / NFC domains as they are. I am not able to compile the OSX version, so i cannot verify the conversion there. Anyhow, on OSX the path names are displayed correctly, Scraper searching fails. You can check from the log, what really was searched, because strings are dumped in url-encoded manner. I believe, on Darwin, the native display is just able to display NFD correctly and no conversions are ever made. Maybe You can verify this. Thanks a lot, Dezi

The NFD_NFC_Tupel array is pretty large, we could move it to some static area later on. Just presented as a working draft here. If someone is interestest in how the table was generated, just give me a note.

The only way it's really going to be solved is using ICU or similar, but that's a large change.

As an interim stop-gap, if the current solution was cleaned up for efficiency (the initial check might be optimised a little by jumping by utf8 characters perhaps, and the LUT could be done in O(1), as there's only a range of 40_64_2 possibilities for the codewords).

For display and passing to 3rd parties (scrapers etc.) only.

I presume that if you have a file stored as NFD (on some HFS+ disk for example) and request it using NFC (i.e. the filename is stored in the database for example in NFC) then things go wrong, or is the disk/filesystem/OS smart enough to figure things out?

The linux file systems ext4 is able to store files in either way, but need filesname "literally" when accessing those files. Means: You have files with Umlauts copied via Mac/OSX and Samba down to ext4 file-system. A dir listing on the Linux box shows "correct" file names. When You try to access the file with a "typed in" filename name, the file is not present. When You use file name completion in bash, it works. Base line: Thank You Apple Developers. That all i have to say.

This patch was a workaround because this thing drow me crazy. I generated the table programatically by testing out all possible diacritical characters in NFD and converting them to NFC. All characters which succeeded a conversion are part of the list.

I know that problem, it's not only mac/linux. If you create some .zip/.rar on mac and unpack it on win32 than you'll have decomposed chars on NTFS (or FAT32/exFAT).
Let's divide problem to smaller pieces and solve them one-by-one.

Correctly display any kind of chars (composed/decomposed) in GUI on any platform

Get scrapers to work properly with decomposed chars

Access files with decomposed chars correctly on all platforms

1 and 2 can be solved together, for 3 we need to carefully inspect our code and remove unwanted charset conversion to store chars in original form.
I will dig deeply into 1 and 2 after merge window.

It would be nice if someone share .zip/.rar with problem chars from mac.

@MartijnKaijser This PR do not cover all possible decomposed chars, convert decomposed->composed is not optimized (as @jmarshallnz suggested) and conversion must in not in scraper as UI did not support decomposed chars too.
PR was not updated for a long time, close it.