Encoding Detection Revised

In recent KDE releases up to version 4.4 Kate unfortunately very often selected the wrong encoding. The result is that e.g. german umlauts (öäü) show up as cryptic signs in the text editor. What I’ve seen lots of times is that in this case people start to fix those characters manually for the entire document. In other words: They totally do not get at all that the text document simply was opened with the wrong encoding. In fact, the users usually do not even know what encoding is at all. While this is of course kind of sad, this certainly won’t change…

Given this fact, the only correct “fix” is a very good automatic encoding detection, such that the encoding is usually chosen correctly. In the rewrite of Kate’s text buffer for KDE 4.5, Christoph also rewrote the file loader including the encoding detection. The detection now works as follows:

﻿try selected encoding by the user (through the open-file-dialog or the console)

try encoding detection (some intelligent trial & error method)

use fallback encoding

In step 1, Kate tries to use the encoding specified in the open-file-dialog or the one given when launching Kate from the console. On success, we are done.

The encoding detection in step 2 first tries unicode encoding by looking for a Byte Order Mark (BOM). If found, it is certain that the text document is unicode encoded. If there is no BOM, Kate next uses a tool from KDElibs (KEncodingProber) to detect the correct encoding. This is basically trial & error: Try encoding A, if there are characters in the document the encoding is not able to represent, try encoding B. Then C and so on… Unfortunately, this also doesn’t always work, because a byte sequence might be valid in several encodings and represent different characters. This is why it’s more or less impossible to get the encoding always right. There is simply no way…

If the encoding detection fails, Kate uses a fallback encoding. You can configure this fallback encoding in the editor component settings in the “Open/Save” category. If the fallback encoding fails as well, the document is marked as read-only and a warning is shown.

What about Kile and KDevelop?

One of the applications that heavily suffered of the wrong encoding detection in the past was the LaTeX editor Kile. The same holds probably for KDevelop (although it’s usually less critical with source code). The good news is, that with KDE >= 4.5 the problems with respect to wrong encoding should be gone. So it’s certainly worth to update if you are affected by this issue.

“Unfortunately, this also doesn’t always work, because a byte sequence might be valid in several encodings and represent different characters. This is why it’s more or less impossible to get the encoding always right. There is simply no way…”

Well, there’s always the possibility to evaluate the resulting text against a dictionary of all know words of all languages for each encoding, to see which encoding results in the largest number of recognized words… And why stop there? Include a neural network algorithm that trains itself to recognize how much “sense” the contents of opened files make under each encoding… As you can see, there’s always a way!

(Just kidding. Encoding detection works great for me in KDE 4.5, great work…)

It isn’t really clear to me if Kate has a more intelligent way to detect encoding besides using KEncodingProber or KEncodingDetector. I tried both of the latter, and I even failed to detect UTF-8. Is there a trick/code that you could share or “port” to kdecore? You even say that the fallback encoding could “fail”. I don’t understand how I can actually check if using a specific encoding “works” or “fails”. I would like to fix bug 228172. Many thanks (for the best editor out there, just misses CygnusEd like macro recording 😉

As I understand, the only additional stuff that would maybe really be of interest is the BOM detection (if not already there in KEncodingProber). BOMs are optionally used on the beginning of a file. So if there is no “file-support” in KEncodingProber (no idea), then it’s not really useful for you.

I really appreciate that 🙂
It was really annoying in the past when you had some iso-8859-1 and some utf-8 files. I upgraded to KDE SC 4.5 yesterday and tried out the encoding detection just a moment ago. Works like a charm, thank you guys 🙂

Yes, that can be done, and I’d assume that KEncodingProber already does something like that — at least that’s the place where it belongs 🙂
We could write thousands and thousands of lines for encoding detection and it will still not always work. So better have something simple that works in almost all cases than having to maintain thousands of lines of code just for supporting 1 corner case more. In other words: We’ll most likely not implement it in Kate directly :^)

As good as the detection code is, it is not infaillible, so it would be great if there was a visible clue when the detection done was not fully thrustworthy (either fallback was used, or the statistics show too many possible encodings). A very neat way to do it would be a bar appearing at the top of the edit area saying “This looks like [a href=encoding docs]encoding foo[/a] but I may be wrong. You can select another encoding now [button: list of statistically-possible encodings] or change the encoding at any later time using menu>foo>bar. [button: encoding is fine]”

The term “not fully trustworthy” does not really make sense: If we found a suitable encoding, we’re simply done. That’s it.
If we did not find a suitable encoding, you get an error message in form of a modal dialog that warns you and the document is marked as read-only.

It indeed makes sense imo to show a bar on the top, that’s a nice idea. Do you volunteer? 🙂

It really should be possible to disable encoding detection completely as it cannot understand files with “invalid” encoding(s). For example files that use different encoding in different parts.

It would be really nice if there’s option to enable “use only selected encoding and sta with it even if it is invalid”, maybe there could be big red letters stating that everyone who keep this checkbox checked might lose (and probably loses) everything.

I’ve just managed to waste about half hour of my life just to realize that “select encoding” menu is broken, killed by some stupid encoding enforcer that tries to be smarter than user. Artificial intelligence might be requirement in some computer games but in text editor there really should be option to disable it.

Using kate, I struggled with files with “unknown” or “broken” encoding many times. What I think is needed is a way to enforce a certain encoding, and have kate show the characters that are “wrong” (at least in which line they are), or simply replace them by a given character, or even just drop them. Often, these are just a few and manually fixing them is not a big deal.

I just had a file that had a NUL character in it (obtained it from a Windows user who uses LyX with some encoding unknown to me – don’t know why the NUL was there).

I spent about an hour with “iconv”, “grep” and “kate” trying to find the problem. Kate basically kept saying something about invalid characters, but iconv did not detect and remove the NUL.