This appeared fine in preview (same font as the rest of the text). But when I hit Post, the character in the answer had changed into the pre-composed 'n with dot below' (U+1E47), which was ugly.

Why this transformation? Implementing this current behaviour seems it would take more effort than leaving the characters as they were, so I'm not even sure whether this is a StackExchange-specific bug or part of some Unicode implementation. Anyway, why does this change happen?

Edit: Browser: Google Chrome 8.0.552.237 on Mac OS X 10.6.6. The possibility just occurred to me that Chrome may be submitting 'ṇ' when I type 'ṇ', in which case it wouldn't be a StackExchange-specific bug, but I'd like to hear from someone who knows the issue better. (I can't thoroughly test it myself on any Stack Exchange site without bumping up posts. :p)

Edit2: It's repeatably happened in this post itself, where in the first "test word" for using the combining character I type the two-character version (n followed by U+0323) but after posting it turns into the former (the character U+1E47). The second 'test word', which I type using the HTML entity, is not affected.

Edit3: In the default font used for meta.SO (Arial), the two forms look alike. Please try a font like Georgia (the default on english.SE) to see the difference. The link to the original post is here, in case it helps.

I think you need to screencap this so we can see the differences. At the moment I just see the same 'n' with a dot underneath both in preview and the posted question. A link to the question would be useful too.
–
KevJan 31 '11 at 11:18

1

@Kev: I've added screenshots. They both look the same in the font here (which is probably Arial), but different in the font on english.SE (which is probably Georgia). In any case, how they look isn't important: the issue is that characters one types are being converted into other ones. :-)
–
ShreevatsaRJan 31 '11 at 13:29

@ShreevatsaR - Thanks for the screenshots. I was initially excited because I thought someone had figured out a way to use <center> or a Markdown equivalent. Clicked edit, and saw &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&‌​nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;. No <center>. :(
–
Kevin VermeerMar 20 '12 at 5:19

2 Answers
2

It seems that the browser or the backend (stackoverflow) that receives the data performs Unicode NFC Normalization because from Unicode point of view the two are equivalent and it
is general recommendation to Normalize all input when dealing with Unicode.

Ah thanks, +1. I suspected there was some Unicode specification behind this. :-) Unfortunately, canonically equivalent does not imply visually equivalent because of font issues, but yes, normalisation seems like a good idea and not necessarily a "bug". The lesson is that if it really matters which character is used, it's better to use explicit HTML entities than to type the character directly. I'm curious to know whether it's the browser or StackExchange that's doing this, but it's not really important.
–
ShreevatsaRJan 31 '11 at 13:45

Wireshark for Chrome on a Mac shows that the n followed by U+0323 is sent as %E1%B9%87, which are the percent-encoded hexadecimal values for a UTF-8 encoded U+1E47.

So it's indeed Chrome that's doing this.

(And it seemed to me that Chrome was doing some double encoding, but apparently this is expected for the application/x-www-form-urlencoded form data that Chrome sends.)

Next, Mac OS X renders the character using Times, not using Georgia. To see which font is used, simply paste the sentence into Text Edit, and select the specific character:

Maybe Georgia does not include the character, or maybe OS X does not like the way the character is included in Georgia. Like in Safari (but not in Firefox; I don't know about Chrome) issues with Microsoft OpenType versus Apple's AAT are known to cause problems with Arial, when (trials) of Microsoft Office install their own OpenType version next to Apple's AAT font.

Chrome uses Times because that is listed as the second fallback in the CSS:

font-family: Georgia, 'Times New Roman', Times, serif;

(Apparently Times New Roman does not include the character either.) Changing that CSS to read:

font-family: Georgia, Arial, 'Times New Roman', Times, serif;

...will make Chrome (and Text Edit) use Arial for that ṇ. On the revisions page, you can test by right-clicking the text, selecting "Inspect Element", then find the font-family thing in the right pane, and double click it to add Arial:

(So, if you have some font that looks like Georgia, but does include the special characters, then you could create some user script to change the CSS for that site.)

Thank you so much; that answers it! (Of course it's possible that SE does it too, since it's a recommended practice.) What is double encoding and how can it matter, BTW?
–
ShreevatsaRJan 31 '11 at 14:17

@ShreevatsaR, U+1E47 is hexadecimal 0xE1B987 when UTF-8 encoded. I figured it was a bit odd that this was not just sent as 3 bytes, but as 9 characters in %E1%B9%87. But, according to Wikipedia, this is fine for application/x-www-form-urlencoded form data (which is indeed used). (So, I edited that part from my answer.)
–
ArjanJan 31 '11 at 14:24

WOW! Many many thanks for showing me how to figure out the font a character is being rendered in… I've in many instances wanted to know such a thing, but it never occurred to me that something as simple as copying the text into TextEdit would work. Got to love the Mac! :-) And yes, the character is indeed missing from Georgia: if you open the Character Palette (now Character Viewer) and display the character U+1E47, it shows "fonts containing selected character" below — and Georgia is not one of the fonts.
–
ShreevatsaRJan 31 '11 at 18:35

Ah thanks, that part I knew. :-) The thing is, Georgia is a serif font and Arial is a sans-serif font, so it's unlikely that they'll go well together or that it's a good idea to put Arial as fallback to Georgia in CSS… we saw that even a serif font like Times seems out of place relative to Georgia. Besides, this is a very rare case and there's a simple workaround, so I'm too lazy for a user script to be worth it. :-)
–
ShreevatsaRFeb 1 '11 at 16:49

And, @ShreevatsaR, certainly a user script would not help on SE sites, as you (we) want everyone to see the nicely formatted post of course. So: just something to keep in mind for other websites. (I'm suddenly wondering how the original post looked on Windows. I might boot an old machine to check that!)
–
ArjanFeb 1 '11 at 16:52