Oracle Tip: Create accented characters with COMPOSE and UNISTR

Oracle9i introduced the COMPOSE function, which allows you to take a sequence of Unicode characters and normalize the text. Find out how you can enter the Unicode sequence in pure ASCII text with the help of the UNISTR function.

This article originally appeared in the Oracle e-newsletter. Click here to subscribe automatically.

Many languages, including English, use
accented characters. Since these characters aren't part of the
ASCII character set, it can be difficult to write code that uses
these characters without looking up Unicode values or using a
Unicode editor and converting to a known character set.

Oracle9i introduced the COMPOSE
function, which allows you to take a sequence of Unicode characters
and normalize the text. This means that it will take a letter such
as 'a' (Unicode character 0097) and a combining mark, such as grave
accent (Unicode character 0300), and create a single character that
is the combination of the two marks. COMPOSE uses special combining
marks, which are a part of the Unicode
standard, rather than the ASCII equivalent punctuation marks.
The result should be Unicode character 00E0 (little latin letter
'a' with a grave accent).

The most common combining characters in ANSI
are:

U+0300: grave accent ( ` )

U+0301: acute accent ( ' )

U+0302: circumflex accent(^)

U+0303: tilde (~)

U+0308: umlaut

It may be difficult for you to enter Unicode
characters 0097 and 0300 at a keyboard without special software or
keyboard drivers. Therefore, one way to enter the Unicode sequence
in pure ASCII text is with the help of the UNISTR function. This
function takes a string of ASCII characters and creates a sequence
of Unicode characters in the national character set (which are
usually installed as either 16-bit Unicode or UTF-8 character set).
It uses hex escape sequences to map any non-ASCII characters in a
way that's similar to Java.

To enter the sequence for a, followed by grave accent
combining character, you can use UNISTR('a\0300') instead of trying
to enter the character directly in the code. This function will
work correctly under any character set and any database that has a
Unicode-based national character set. You can put more than the
combining characters inside the functions—you can include a mix of
ASCII and Unicode escapes in the UNISTR function. For example, you
could write:

select COMPOSE(UNISTR('Unless you are nai\0308ve,
meet me at the cafe\0301 with
your re\0301sume\0301.')) from dual;

When you combine the output of this function
with COMPOSE, you can generate a Unicode character without looking
up each individual value. For example:

select 'it is true' if compose(unistr('a\0300')) =
unistr('\00e0');

The COMPOSE function returns an NVARCHAR2
string, which is usually Unicode-based. When using these characters
locally, the database will attempt to map the Unicode characters to
your local character set when it does an implicit TO_CHAR on the
results. Not all characters can be mapped, and there are some
character combinations that don't work in COMPOSE because the
Unicode consortium hasn't defined them at the level used by the
Oracle database.

To
run a quick check of how characters look in a particular
environment, you can run a script similar to the following to see
how the combined characters will be mapped on output. You may need
to investigate NLS_LANG settings to make sure these characters
return correctly:

create or replace type hexrange_tbl as table of
varchar2(4);
/
show errors;

For fun, here's a short PL/SQL script that uses
COMPOSE and UNISTR to create an effect which many SMS users,
hackers, and spammers around the world use to make readable English
text that is difficult to scan because it uses a random sequence of
accented versions of characters. I use DBMS_RANDOM to randomly
select a combining character that can be used with different
letters and lets SQL combine and convert back to produce
ANSI/Latin-1 output. This script is hard-coded against the ENAME
column of the EMP table.