Answered by:

ASCII values for extended characters

Question

I am building a T-SQL script where I need to get the ASCII value of a character. That was easy enough using the ASCII() function. When analyzing the text data, I found that it contains some extended characters, for example, the "a" with an
accent: "á". This is ASCII char 160:

However, when I get the ASCII in T-SQL, it returns 225:

Why is this? I am currently looking for more information that will explain what is going on. I suspect that it might have to do with character sets. On this hunch I got the character set that the database is using:

Answers

To be pedantic, ASCII values are in the range from 32 to 127. That is what is covered by the ASCII standard. Not more.

Today, there are a lot of character sets in use, but about all of them agree with ASCII with the range 32 to 127. However, for the code points beyond 127 they may disagree.

In Windows terms, a character set is called a code page. It we restrict ourselves to code pages used for character sets west of India (i.e. Latin, Cyrillic, Greek, Hebrew and a few more scripts), these code pages encompass code points up to 255, that is
as much as you can fit into a single byte.

Obviously, code pages for Latin and Greek will be different, since they will include different characters. A code page for Greek includes a glyph for alpha which is missing from a Latin code page, which on the other hand may include á.

To make things more complicated, when these eight-bit sets were introduced in the end of the eighties, different vendors invented their own "extended ASCII". I don't know the exact history, but it was the definition used by Digital that prevailed
as the standard known as Latin-1 which later became the foundation for Unicode. In Latin-1 code point 160 is no-break-space, and lowercase a with acute accent is found at position 225.
The chart you are looking at is for a different set, I would guess one of CP437 and CP850 which were the code pages that IBM devised and used for their PC. Both of these are intended for Western Europe, but CP850 is better for multi-national use. Windows still
use any of these in the command-line window for backwards compatibility.

SQL Server does have some collations that support these legacy code pages, and you can see which code page a collation is for by running

SELECT collationproperty('collationname', 'CodePage')

The collation you have in your screen shot uses code page 1252, which is the same as Latin-1.

I don't know what reference you used but "extended ASCII" is a misnomer. ASCII characters are in the 0-127 code point range and the remaining 128-255 code point characters are determined by the code page. This should have been called out in the
reference.

You are on the right track that the different characters and code points are related to the collation since the collation determines the code page SQL Server uses for char/varchar data. SQL_Latin_General_CP1_CI_AS
uses Windows code page 1252 whereas the reference might use code page 437. To wit:

UNICODE is a character set for example but you have several Encoding for this character set like UTF-16LE, UTF-16BE, UTF-7, UTF-8, and so on. The Code Page is the name of the Encoding, for example the above encoding has the code pages: 1200,
1201, 65000, 65001, and so on. So we can see that a single character set (UNICODE) have multiple Code Points, since it has multiple Character Encoding.

In short here are the main terms:

> "Character" is a logical entity used by people.
> “Encoded Value” is set of zero and one which is what Computers use
> “Code Points” are simply a range of numbers, which used to map between characters and the encoded values.
> "Character set" is a group of characters, mapped to specific range of code points. Same character SET name can be used for different groups of characters like in the case of Extended ASCII, but these will be mapped to the same range of Code Points.
> “Character Encoding” is the algorithm we use to encode the code points into Encoded values
> "Code page” is a number/name which represent the “Character Encoding”

The terms are a bit different from one documentation to the other (unfortunately) which lead to a lot of confusion.

As Erland said, different vendors invented their own Encoding for the character set "extended ASCII". The above image shows Microsoft code points but there are other code points for IBM Encoding for example. For example Code page 862 is a
code point for IBM encoding for Hebrew, which by the way it is used under DOS for Hebrew.

ASCII standard character set was initially composed of 128 characters (7-bit code). The first 32 characters are control characters but as Dan said
they are part of ASCII code points range which is 0-127.

SQL Server uses a parameter name COLLATE to represent multiple properties which configure the rules of how to use the data. One of these properties
together with the data type configures the Code Page. For example using COLLATE with data type NVARCHAR will always be UTF16/UCS2 encoding, but with data type VARCHAR different collate can have different Code Points. In SQL 2019 we simply got
more COLLATE which support the encoding UTF8.

In order to get the different properties of each COLLATE we can use the function COLLATIONPROPERTY

In order to get the COLLATE we have, we can use the function sys.fn_helpcollations()

Putting these two together the following query will return some of the properties of the COLLATE in the system:

If you execute this query you can get the code points for the encoding for different COLLATE if we use data types like CHAR and VARCHAR.

As Erland said code point 1252 is used to represent LatinQuote from Wikipedia: "Windows-1252 or CP-1252 (code page 1252) is a single-byte
character encoding of the
Latin alphabet, used by default in the legacy components of Microsoft Windows for English and
some other Western languages".

All replies

To be pedantic, ASCII values are in the range from 32 to 127. That is what is covered by the ASCII standard. Not more.

Today, there are a lot of character sets in use, but about all of them agree with ASCII with the range 32 to 127. However, for the code points beyond 127 they may disagree.

In Windows terms, a character set is called a code page. It we restrict ourselves to code pages used for character sets west of India (i.e. Latin, Cyrillic, Greek, Hebrew and a few more scripts), these code pages encompass code points up to 255, that is
as much as you can fit into a single byte.

Obviously, code pages for Latin and Greek will be different, since they will include different characters. A code page for Greek includes a glyph for alpha which is missing from a Latin code page, which on the other hand may include á.

To make things more complicated, when these eight-bit sets were introduced in the end of the eighties, different vendors invented their own "extended ASCII". I don't know the exact history, but it was the definition used by Digital that prevailed
as the standard known as Latin-1 which later became the foundation for Unicode. In Latin-1 code point 160 is no-break-space, and lowercase a with acute accent is found at position 225.
The chart you are looking at is for a different set, I would guess one of CP437 and CP850 which were the code pages that IBM devised and used for their PC. Both of these are intended for Western Europe, but CP850 is better for multi-national use. Windows still
use any of these in the command-line window for backwards compatibility.

SQL Server does have some collations that support these legacy code pages, and you can see which code page a collation is for by running

SELECT collationproperty('collationname', 'CodePage')

The collation you have in your screen shot uses code page 1252, which is the same as Latin-1.

I don't know what reference you used but "extended ASCII" is a misnomer. ASCII characters are in the 0-127 code point range and the remaining 128-255 code point characters are determined by the code page. This should have been called out in the
reference.

You are on the right track that the different characters and code points are related to the collation since the collation determines the code page SQL Server uses for char/varchar data. SQL_Latin_General_CP1_CI_AS
uses Windows code page 1252 whereas the reference might use code page 437. To wit:

UNICODE is a character set for example but you have several Encoding for this character set like UTF-16LE, UTF-16BE, UTF-7, UTF-8, and so on. The Code Page is the name of the Encoding, for example the above encoding has the code pages: 1200,
1201, 65000, 65001, and so on. So we can see that a single character set (UNICODE) have multiple Code Points, since it has multiple Character Encoding.

In short here are the main terms:

> "Character" is a logical entity used by people.
> “Encoded Value” is set of zero and one which is what Computers use
> “Code Points” are simply a range of numbers, which used to map between characters and the encoded values.
> "Character set" is a group of characters, mapped to specific range of code points. Same character SET name can be used for different groups of characters like in the case of Extended ASCII, but these will be mapped to the same range of Code Points.
> “Character Encoding” is the algorithm we use to encode the code points into Encoded values
> "Code page” is a number/name which represent the “Character Encoding”

The terms are a bit different from one documentation to the other (unfortunately) which lead to a lot of confusion.

As Erland said, different vendors invented their own Encoding for the character set "extended ASCII". The above image shows Microsoft code points but there are other code points for IBM Encoding for example. For example Code page 862 is a
code point for IBM encoding for Hebrew, which by the way it is used under DOS for Hebrew.

ASCII standard character set was initially composed of 128 characters (7-bit code). The first 32 characters are control characters but as Dan said
they are part of ASCII code points range which is 0-127.

SQL Server uses a parameter name COLLATE to represent multiple properties which configure the rules of how to use the data. One of these properties
together with the data type configures the Code Page. For example using COLLATE with data type NVARCHAR will always be UTF16/UCS2 encoding, but with data type VARCHAR different collate can have different Code Points. In SQL 2019 we simply got
more COLLATE which support the encoding UTF8.

In order to get the different properties of each COLLATE we can use the function COLLATIONPROPERTY

In order to get the COLLATE we have, we can use the function sys.fn_helpcollations()

Putting these two together the following query will return some of the properties of the COLLATE in the system:

If you execute this query you can get the code points for the encoding for different COLLATE if we use data types like CHAR and VARCHAR.

As Erland said code point 1252 is used to represent LatinQuote from Wikipedia: "Windows-1252 or CP-1252 (code page 1252) is a single-byte
character encoding of the
Latin alphabet, used by default in the legacy components of Microsoft Windows for English and
some other Western languages".

Erland, Dan,
and Ronen - thank you for your replies. This is exactly what I was looking for, to understand the reason behind the "odd" result I got with ASCII('á').

I guess the function name "ASCII" is a bit misleading since it only returns the ASCII of the char for those whose value is in
the 0 to 127 range. For those chars above 127, it will return the value as per the code page being used. I understand that the "extended ASCII" set is an IBM invention for their IBM PC and that it has been adopted as an unofficial "standard"

Erland, not pedantic at all! There are so many details that, like the telephone game, can result in different interpretations. You'll get
no argument from me, as I prefer not to delve too deeply into the fine details; however, I suspect that you will get an argument from Wiki, since the definition presented is as follows:

Erland, Dan,
and Ronen - thank you for your replies. This is exactly what I was looking for, to understand the reason behind the "odd" result I got with ASCII('á').

I guess the function name "ASCII" is a bit misleading since it only returns the ASCII of the char for those whose value is in
the 0 to 127 range. For those chars above 127, it will return the value as per the code page being used. I understand that the "extended ASCII" set is an IBM invention for their IBM PC and that it has been adopted as an unofficial "standard"

Erland, not pedantic at all! There are so many details that, like the telephone game, can result in different interpretations. You'll get
no argument from me, as I prefer not to delve too deeply into the fine details; however, I suspect that you will get an argument from Wiki, since the definition presented is as follows:

First, thanks for the thanks 👍✔ I appreciate it.
Unfortunately most people here forget to say thanks and I hate it.

Back to the discussion...

I think that you still have a confusion regarding the function ASCII.
This function is NOT "only returns the ASCII
of the char for those whose value is in the 0 to 127 range".

The function ASCII is actually Extended ASCII function, but for short name Microsoft choose to use the name ASCII. Basically in SQL Server when we speak about ASCII then we probably mean Extended ASCII.

The function ASCII returns the extended ASCII Code Point between 0 and 255. The encoded values of Code Points in the range 0-127 for Extended ASCII are the same as fro ASCII.

Another function which we must to mention is the opposite direction function CHAR which get Code Point as input and the returns the character value in Extended ASCII