July 2008 Archives

A relatively "hot" new addition to Unicode 5.1 is LATIN CAPITAL LETTER DOUBLE S (aka Sharp S or ß) for German. I'd thought I'd write about this because it covers both policy and an important Unicode concept of casing.

About Sharp S (ß)

Many of you may already know about lowercase Sharp S (ß) which is used in German spelling as a replacement for "ss". For instance, the German word gross 'large' could also be spelled as groß and Strasse 'street' can be spelled as Straße. The form itself is an old manuscript convention that was incorporated into modern typograhpy.

So far so good, but what it means from a computing perspective is that any program working with German text has to know that gross and groß are essentially the same word, just with sligthly different spellings. If you're looking in a library database for instance, you would want to see both sets of results. On an interesting side note, I entered in groß and pulled up the English Wikipedia page on the "gross" unit of measure as the first result - correct, but weird..

But not capital ß

In official German spelling convention, there is NO CAPITAL SHARP S. First, no German word starts with "SS", so no word could ever begin with ß anyway. But even if a word is in all-caps or small caps, the convention should be to convert all ß to SS - thus groß should be GROSS in all caps.

Makes sense...except that people in German DO use capital Sharp S in some signs, gravestones and business names (similar to "Nite-Quil" instead of "Night-Quill"). The 2004 Proposal on Encoding Capital S Sharp (PDF) contains a variety of photographs of Capital S Sharp in use. You can see one of these on Wikipedia (Capital ß page). In other words, Unicode ultimately has to bow to social usage.

So finally we have the official Unicode announcement...

Official Unicode 5.1 Announcement

U+1E9E LATIN CAPITAL LETTER SHARP S

In particular, capital sharp s is intended for typographical representations of signage and uppercase titles, and other environments where users require the sharp s to be preserved in uppercase. Overall, such usage is rare. In contrast, standard German orthography uses the string "SS" as uppercase mapping for small sharp s. Thus, with the default Unicode casing operations, capital sharp s will lowercase to small sharp s, but not the reverse: small sharp s uppercases to "SS". In those instances where the reverse casing operation is needed, a tailored operation would be required.

Now What?

First the fonts will have to be developed to include a capital ß variant. This may or be in your system yet. Here's a quick test. It wasn't looking good, even though I am on on Leopard Mac.

Character Name

Unicode Number

Character

LATIN SMALL LETTER SHARP S

U+00DF

ß

LATIN CAPITAL LETTER SHARP S

U+1E9E

ẞ

Next comes the "casing" question. Casing is the set of eqiuvalences which match capital and lowercase letters as "the same" even though they are really two Unicode code points. For instance capital A is U+0041 (ASCII 65) encoded as while lowercase A is U+0061 (ASCII 97). When you search Google and most databases, both A and a are treated the same (yet are kept distinct enough so that you can switch between A and a in your word processor). Note that English casing (technically "accent folding", added Mar 5, 2010) also conflates Á,Å,À,Ä as just A.

Update from 8 Aug: Technically this probably isn't "casing", but the principle is the same - you conflate certain variants as "one" character.

As stated before, official German spelling does not recognize capital ß, but not surprisingly, there was a discussion in the Unicode list just this week on whether this too will change over time. I'll be staying tuned.

A Linguistic Closing Thought

Normally linguists talk about seeing a sound change or a grammar change in progress, but this appears to be a spelling change in progress. Wikipedia Capital ß page claims that legal documents often use capital sharp S in all cap names in order to avoid ambiguiity (e.g the defendant Hans Straßer or HANS STRAßER). And apparently the most notorious use of capital ß is the title page of Der Große Duden (The Great Duden dictionary) which was rendered as DER GROßE DUDEN. Clearly the capital sharp S was destined for permanent encoding.