Can Using UTF-8 Degrade Performance in Some Languages ?

joebert

Fart Bubbles

Posts: 13506

Loc: Florida

3+ Months Ago

I've been reading up a bit more on UTF-8 lately and one of the disadvantages I came across was UTF-8 using 2 and in some cases 3 times as much space to represent data as the native encoding for the language would use.

Quote:

UTF-8 encoded text is larger than the appropriate single-byte encoding except for plain ASCII characters. In the case of languages which used 8-bit character sets with non-Latin alphabets encoded in the upper half (such as most Cyrillic and Greek alphabet code pages), letters in UTF-8 will be double the size. For some languages such as Hindi's Devanagari and Thai, letters will be triple the size (this has caused objections in India and other countries).

My first thought was big deal, storage space is pretty cheap these days. But then I thought about the RAM it must be using to work with this UTF-8 encoded data in an application. If you're using 2 and 3 times as much memory to work with the same data as you would use if you used the native encoding for the language, that's a big deal when you think about it. It means the application is only 50% as efficient as it would be by simply using the encoding designed for the language, or in the case of the languages mentioned in that quote, 33% as efficient.

Basically by enforcing UTF-8 in an application in an attempt to have a multi-lingual selling point for that application, you're in some cases requiring the buyer to use 2-3X as much hardware to get the same performance as anyone else using the exact same software. Which is probably the exact oppisite of a selling point. It's probably enough to defeat the purpose of you deciding to use UTF-8 in the first place.

Am I crazy ?Have I been reading too many Global Warming and Green Energy headlines in the news ?

mk27

Proficient

Posts: 334

3+ Months Ago

joebert wrote:

It means the application is only 50% as efficient as it would be by simply using the encoding designed for the language, or in the case of the languages mentioned in that quote, 33% as efficient.

I don't think the consequences will add up that way literally, either for the RAM or the program efficiency, because you are working with UTF-8 values, but it is true you will be dealing with twice as much space with disk files.

I don't think the consequences will add up that way literally, either for the RAM or the program efficiency, because you are working with UTF-8 values, but it is true you will be dealing with twice as much space with disk files.

Are you suggesting it takes less RAM to store multi-byte characters than it does disk space ?

Why would it take any less space to store the character in RAM than it would on disk ?

mk27

Proficient

Posts: 334

3+ Months Ago

joebert wrote:

Are you suggesting it takes less RAM to store multi-byte characters than it does disk space ?

Why would it take any less space to store the character in RAM than it would on disk ?

No, I'm suggesting by and large most applications use an amount of human readable text which, relative to the other components maintained in RAM, is very minor. Eg, the average web page contains an amount of text which would only account for a few percent (or less) of the memory held by the processes at either end -- ie, in simple terms the server and the browser occupy much much more memory than the trivial amount of text they present.