i mean the UTF8 is legacily supporting ASCII too. so even if you have to support legacy stuff, UTF8 would work just fine no other changes needed.
–
PacerierJul 30 '11 at 14:13

1

Maybe you've got to interoperate with a system that packs 8 ASCII characters into 7 bytes? People did crazy stuff to fit things in.
–
Donal FellowsJul 31 '11 at 11:42

Call me nuts, but I'd say security and stability. A character set without multi-byte sequences is a lot harder to break. Don't get me wrong, when human language support is important ASCII won't cut it. But if you're just doing some basic programming and can squeeze yourself into the native language the compiler and operating system were written for, why add the complexity? @Donal Fellows. Last I checked... ASCII is 7 bytes. (anything with that extra bit just isn't ASCII and is asking for trouble)
–
ebyrobApr 1 '14 at 13:37

@ebyrob I think Donal Fellows means bit packing 8 ascii symbols into 7 bytes, since each symbol is using 7 bits each ... 8*7=56 bits = 7 bytes. It would mean a special encode and decode function, just to save 1 byte of storage out of every 8.
–
dodgy_coderFeb 26 at 7:08

5 Answers
5

In some cases it can speed up access to individual characters. Imagine string str='ABC' encoded in UTF8 and in ASCII (and assuming that the language/compiler/database knows about encoding)

To access third (C) character from this string using array-access operator which is featured in many programming languages you would do something like c = str[2].

Now, if the string is ASCII encoded, all we need to do is to fetch third byte from the string.

If, however string is UTF-8 encoded, we must first check if first character is a one or two byte char, then we need to perform same check on second character, and only then we can access the third character. The difference in performance will be the bigger, the longer the string.

This is an issue for example in some database engines, where to find a beginning of a column placed 'after' a UTF-8 encoded VARCHAR, database does not only need to check how many characters are there in the VARCHAR field, but also how many bytes each one of them uses.

@DeanHarding How does the character count tell you where the second character starts? Or should the database hold an index for each character offset too? Note: It isn't just 2 characters, but could be up to 4 (unless when it's 6) stackoverflow.com/questions/9533258/…. (I think it's only utf-16 that had the really long abominations that could destroy your system)
–
ebyrobApr 1 '14 at 13:44

If you're going to use only the US-ASCII (or ISO 646) subset of UTF-8, then there's no real advantage to one or the other; in fact, everything is encoded identically.

If you're going to go beyond the US-ASCII character set, and use (for example) characters with accents, umlauts, etc., that are used in typical western European languages, then there's a difference -- most of these can still be encoded with a single byte in ISO 8859, but will require two or more bytes when encoded in UTF-8. There are also, of course, disadvantages: ISO 8859 requires that you use some out of band means to specify the encoding being used, and it only supports one of these languages at a time. For example, you can encode all the characters of the Cyrillic (Russian, Belorussian, etc.) alphabet using only one byte apiece, but if you need/want to mix those with French or Spanish characters (other than those in the US-ASCII/ISO 646 subset) you're pretty much out of luck -- you have to completely change character sets to do that.

ISO 8859 is really only useful for European alphabets. To support most of the alphabets used in most Chinese, Japanese, Korean, Arabian, etc., alphabets, you have to use some completely different encoding. Some of these (E.g., Shift JIS for Japanese) are an absolute pain to deal with. If there's any chance you'll ever want to support them, I'd consider it worthwhile to use Unicode just in case.

yes i'm talking about the 7-bit ASCII set. can you think of 1 advantage we will ever need to save something as ascii instead of utf-8? (since the 7-bit would be saved as 8-bit anyway, the filesize would be exactly the same)
–
PacerierJul 30 '11 at 14:13

1

If you have characters larger than unicode value 127, they cannot be saved in ASCII.
–
user1249Jul 30 '11 at 14:47

1

@Pacerier: Any ASCII string is a UTF-8 string, so there is no difference. The encoding routine might be faster depending on the string representation of the platform you use, although I wouldn't expect significant speedup, while you have a significant loss in flexibility.
–
back2dosJul 30 '11 at 16:04

@Thor that is exactly why i'm asking if saving as ASCII has any advantages at all
–
PacerierJul 30 '11 at 17:06

2

@Pacerier, if you save XML as ASCII you need to use e.g. &#160; for a non-breakable space. This is more filling, but makes your data more resistant against ISO-Latin-1 vs UTF-8 encoding errors. This is what we do as our underlying platform does a lot of invisible magic with characters. Staying in ASCII makes our data more robust.
–
user1249Jul 30 '11 at 17:29

First of all: your title uses/d ANSI, while in the text you refer to ASCII. Please note that ANSI does not equal ASCII. ANSI incorporates the ASCII set. But the ASCII set is limited to the first 128 numeric values (0 - 127).

If all your data is restricted to ASCII (7-bit), it doesn't matter whether you use UTF-8, ANSI or ASCII, as both ANSI and UTF-8 incorperate the full ASCII set. In other words: the numeric values 0 up to and including 127 represent exactly the same characters in ASCII, ANSI and UTF-8.

If you need characters ourside of the ASCII set, you'll need to choose an encoding. You could use ANSI, but then you run into the problems of all the different code pages. Create a file on machine A and read it on machine B may/will produce funny looking texts if these machines are set up to use different code pages, simple because numeric value nnn represents differents characters in these code pages.

This "code page hell" is the reason why the Unicode standard was defined. UTF-8 is but a single encoding of that standard, there are many more. UTF-16 being the most widely used as it is the native encoding for Windows.

So, if you need to support anything beyond the 128 characters of the ASCII set, my advice is to go with UTF-8. That way it doesn't matter and you don't have to worry about with which code page your users have set up their systems.

if i do not need to support beyond 128 chars, what is the advantage of choosing ACSII encoding over UTF8 encoding?
–
PacerierJul 30 '11 at 17:16

Besides limiting yourself to those 128 chars? Not much. UTF-8 was specifically designed to cater for ASCII and most western languages that "only" need ANSI. You will find that UTF-8 will encode only a relatively small number of the higher ANSI characters with more than one byte. There is a reason most of the HTML pages use UTF-8 as a default...
–
Marjan VenemaJul 30 '11 at 18:57

1

@Pacerier, if you don't need encoding above 127, choosing ASCII may be worth when you use some API to encode/decode, because UTF needs additional bit verification to consider additional bytes as the same character, it can takes additional computation rather than pure ASCII which just read 8 bits without verification. But I only recommend you to use ASCII if you really need a high level of optimization in large (big large) computation and you know what you're doing in that optimization. If not, just use UTF-8.
–
LucianoDec 19 '12 at 13:03

Yes, there are still some use cases where ASCII makes sense: file formats and network protocols. In particular, for uses where:

You have data that's generated and consumed by computer programs, never presented to end users;

But which it's useful for programmers to be able to read, for ease of development and debugging.

By using ASCII as your encoding you avoid the complexity of multi-byte encoding while retaining at least some human-readability.

A couple of examples:

HTTP is a network protocol defined in terms of sequences of octets, but it's very useful (at least for English-speaking programmers) that these correspond to the ASCII encoding of words like "GET", "POST", "Accept-Language" and so on.

The chunk types in the PNG image format consist of four octets, but it's handy if you're programming a PNG encoder or decoder that IDAT means "image data", and PLTE means "palette".

Of course you need to be careful that the data really isn't going to be presented to end users, because if it ends up being visible (as happened in the case of URLs), then users are rightly going to expect that data to be in a language they can read.

Well said. It's a little ironic that HTTP, the protocol that transmits the most unicode on the planet only needs to support ASCII. (Actually, I suppose the same goes for TCP and IP, binary support, ASCII support... that's all you need at that level of the stack)
–
ebyrobApr 1 '14 at 13:58