Khitan Large Script (契丹大字)

The "Khitan large script" is a logographic/syllabic script derived from and imitating Chinese characters (see How Complex is Tangut ? for a discussion of the relative complexities of the Tangut, Jurchen and Khitan large scripts). It comprises several hundred characters that each have a distinct logographic meaning or syllabic pronunciation. These characters are not only similar in form and construction to Chinese characters, but up to 30% of the known Khitan large characters are borrowed directly from Chinese for use as phonetic borrowings or to represent borrowed words from Chinese (see for example the borrowed characters 皇帝 "emperor" and 囯 "country" in the Memorial of the Prince of the North shown below). Some of the characters in the later Jurchen script appear to be derived from Khitan large characters, and have a common pronunciation or meaning.

Part of a rubbing of the memorial stone of the Prince of the North (Khitan large script)

Khitan Small Script (契丹小字)

The "Khitan small script" is a mixed writing system that mostly comprises phonetic elements that are joined together in a rectangular phonographic block that represents the pronunciation of a word, together with a relatively small number of logographic characters that are used to represent frequently used vocabulary such as numbers and the cardinal directions. The phonetic elements that make up phonograms appear to be derived from Chinese characters, and some of them are the same as Khitan large characters, although none of the Khitan small script logographic characters are the same as the corresponding Khitan large script logographic characters.

Part of a rubbing of the memorial tablet of Emperor Daozong (Khitan small script)

The phonetic elements are arranged in groups of one through seven characters as shown below.

Arrangement of phonetic elements making up phonograms in the Khitan small script

There are about fifty known monumental inscriptions in the two Khitan scripts, of which about 17 are in the Khitan large script and about 33 are in the Khitan small script, which suggests that the small script was more widely used than the large script, but it is not known why the Khitan people used these two different scripts, or what determined the choice of which script to use. Japanese uses multiple different scripts (kanji, hiragana and katakana), but these are differentiated functionally, and are normally used in conjunction within the same text; whereas the two Khitan scripts appear to be mutually exclusive as they never occur together on the same monument or artefact. Why then are there two Khitan scripts ?

Hypothesis A : Chronological Variation

The first idea that springs to mind is the possibility that the two scripts were not used at the same time. Perhaps one script was used first, but was later displaced by the other script. According to the History of the Liao (see juan 2, 75 and 89), the Khitan large script was created by order of Emperor Taizu of Liao with the assistance of Yelü Tulübu 耶律突呂不 and Yelü Lubugu 耶律魯不古, and was introduced at the start of the year 920. The small script was reputedly devised four or five years later, influenced by the Uyghur script, by Yelü Diela 耶律迭剌, the younger son of Emperor Taizu. We might therefore expect that the large script was used during the reign of Emperor Taizu, and the phonetic small script gradually become more widely used after the death of the emperor in 926, eventually displacing the more cumbersome large script. However, this is not borne out by the extant corpus of inscriptions.

Khitan Large Script Inscriptions by Date

Year

Inscription

986

Memorial for Yelü Yanning 耶律延寧 (946–985)

1041

Memorial for the Prince of the North 北大王 (Yelü Wanxin 耶律萬辛, 972–1041)

1056

Memorial for an unknown person

1056

Memorial for the Grand Preceptor (太師)

1058

Stone inscription from Dornogovi Province, Mongolia

1062

Memorial for Yelü Changyun 耶律昌允 (1000–1061)

1072

Memorial at Jing'an Temple (靜安寺) errected by the Lady of Lanling Commandery (蘭陵郡夫人)

Record of the Younger Brother of the Emperor of the Great Jin Dynasty (Da Jin huangdi dutong jinglüe Langjun xingji 大金皇弟都統經略郎君行記)

1150

Memorial for Xiao Zhonggong 蕭仲恭

1171

Memorial for the Jin Dynasty Defense Commissioner of Bozhou 金代博州防禦使

The first noticable feature of the above tables is that there are no dated Khitan inscriptions dating to the time of Emperor Taizu or any time soon after. Except for a single large Khitan inscription dating to 986, the earliest dated inscriptions only date back to the mid 11th century, over a hundred years after the recorded creation of both scripts. Clearly the large script was not displaced soon after the death of Emperor Taizu. On the contrary, the two scripts seem to have coexisted happily for at least two hundred years, from the mid 11th century, through the fall of the Liao dynasty (907–1125), and into the first half of the Jin dynasty (1115–1234). Both scripts seem to have continued in use up to at least the 1170s, with neither displacing the other, and it was only with the proscription of Khitan by the Jurchen court in 1191–1192 that both scripts finally fell out of use.

Distribution of Khitan Inscriptions by Date

Hypothesis B : Geographic Variation

If the Khitan scripts do not show any significant chronological variation, then perhaps they show a different geographical distribution, with the Khitan small script used in one part of the Khitan territory, and the large script in another part of the Khitan territory. But this does not appear to be supported by the distribution map shown below (click on the map to explore it in greater detail). Although there does seem to be some clustering of small script inscriptions, there is no obvious geographical distinction between the two scripts.

Hypothesis C : Different Functional Usage

Perhaps the two scripts had different functions, for example one for writing religious texts and one for writing secular texts, or one for writing official and court documents and one for writing private and personal documents ? But as both scripts were commonly used for exactly the same function (writing memorials for the dead) this theory seems to be a non-starter.

Hypothesis D : Different Social Usage

Maybe the two different scripts were used by two different sections of the Khitan population. Was one script used by men and the other script used by women ? This seems not to be the case, as both scripts are used to write memorials for both men and women. Was one script used by royalty and nobility, and the other script used by commoners ? Probably not, as there are memorials to princes and princesses in both scripts, although the only memorials to emperors and empresses found so far are in the small script. Were the scripts used by different clans ? Again, there is no evidence for this, as both scripts were used to write memorials for members of the Yelü 耶律 clan.

Hypothesis E : Different Linguistic Usage

A final possibility is that the two scripts were used to write two different languages or dialects. Although there is no evidence that the Khitans spoke more than a single language, it is a possibility that cannot be discounted. But it is a theory that is difficult to prove or disprove as most of the Khitan words that have been identified in the small script are borrowings from Chinese, and almost all the large Khitan script words for which a reading has been proposed are also borrowings from Chinese.

Having looked at and discounted the various possibilities outlined above, we seem to be none the wiser about why there were two completely different ways of writing the Khitan language. Both scripts are complex enough to require a considerable investment of time and effort to learn to read and write, so how is it possible that both scripts managed to coexist and flourish for so long ? Did the Khitan education system require students to learn both scripts, or were Khitan scholars only able to read and write one or other of the two scripts ? It makes no sense to me ...

Addendum A [2011-10-15]

Unbeknownst to me at the time I wrote this post, less than a month earlier, on the 29 November 2010, Viacheslav Zaytsev of the Institute of Oriental Manuscripts [IOM] in Saint Petersburg had announced his identification of a 100+ page manuscript codex as being written in the Large Khitan script. This manuscript had been held at the IOM for many years, but as it was written in a cursive hand no-one had been able to identify the script with certainty. Most experts who had seen the manuscript had thought it was probably written in the Jurchen script, but by carefully comparing the text of the manuscript with memorial inscriptions written in Large Khitan, Zaytsev had been able to identify stretches of text that occured in both, and he was thereby able to prove for the first time that the manuscript was written in the Large Khitan script. This is the first and only manuscript written in either Large or Small Khitan to have been identified.

Addendum B [2011-10-21]

Viacheslav Zaytsev has drawn my attention to the fact that a fragment of a Khitan large script inscription was identified by Wang Ding in 2002. I discuss this fragment in Khitan Miscellanea 1.

Important Technical Note

BabelPad and BabelMap were scheduled for release on 11 October, to coincide with the release of Unicode 6.0 on that day, but their release was delayed due to a Blue Screen of Death crash that occured with the beta versions of both BabelMap and BabelPad when the Windows function ExtTextOutW is called within a path bracket, and the selected font is Symbola font version 6.00, and the ETO_GLYPH_INDEX flag is set, and the glyph index passed to the function corresponds to U+1F5FD STATUE OF LIBERTY (this problem only occurs in BabelPad when in Simple Rendering mode, which bypasses Microsoft's Uniscribe rendering engine). The glyph for U+1F5FD in the Symbola font has a mega-complex glyph outline (which oddly enough is the glyph for an angel, whilst the glyph for the Statue of Liberty is actually at U+FFFED), which probably results in a buffer overrun somewhere within Windows GDI. In order to work around this problem I have had to rewrite, refactor and retest core sections of the source code.

The newly released versions of BabelPad and BabelMap fix the problem described above, and should be safe for use with the Symbola font under normal usage scenarios, but the glyphs for U+1F5FB (MOUNT FUJI) through U+1F5FF (MOYAI) are rendered very slowly because of their extreme complexity (several thousand points for each glyph), resulting in sluggish response in BabelMap when scrolling through the Miscellaneous Symbols And Pictographs block, and potentially extremely sluggish performance in BabelPad.

Moreover, the Symbola font may still cause a Blue Screen of Death crash (reporting an infinite loop) on some systems when rendering U+1F5FD STATUE OF LIBERTY at high point sizes with standard Windows applications such as Notepad (my test case is to set Notepad to use the Symbola font at 72 points, and then paste in a string comprising twelve instances of U+1F5FD — my XP machine then blue screens, although my Vista machine is OK). This general Windows-level vulnerability to Symbola version 6.00 means that BabelMap may still blue screen if you insert multiple instances of U+1F5FD into the BabelMap edit buffer, and BabelPad may still blue screen if you attempt to display a document with multiple instances of U+1F5FD at a large point size in Complex Rendering mode (i.e. using Uniscribe). For this reason, you are strongly advised not to install Symbola version 6.00, but if you do install this font I cannot be responsible for any loss or damage incurred due to a system crash when running either BabelPad or BabelMap.

BabelPad Enhancements

BabelPad now emulates the Alt-X functionality found in Microsoft Word and WordPad (position the caret after a hexadecimal code pont value and hit Alt-X to convert it to the corresponding Unicode character; and position the caret after a Unicode character and hit Alt-X to convert it to its corresponding hexadecimal code point value)

Convert Unicode character names to their corresponding Unicode character (due to the difficulty of disambiguating strings such as "bell symbol for bell with cancellation stroke" where "bell", "bell symbol", "symbol for bell" and "bell with cancellation stroke" are all Unicode character names, the selected text must be an exact Unicode name or formal alias, and not a partial name or a longer text string containing a Unicode name; although you can use the contextual convert utility to convert structured data such as <UnicodeName>Vulgar Fraction Three Quarters</UnicodeName> to <UnicodeName>¾</UnicodeName>) ["Convert : Unicode Name to Character" from the main menu or the right-click menu]

Convert Han ideographs to their pinyin or jyutping readings (not perfect as characters with multiple readings are converted to a slash-separated list of readings, even when one reading is considerably more common than another, but this feature may be useful for some users in some situations)

Title casing options for either Script Neutral title casing (e.g. The Owl And The Pussy-Cat Went To Sea) or English title casing (e.g. The Owl and the Pussy-Cat Went to Sea) ["Options : Title Casing" from the menu]

The default script colours when colour coding by script is selected ["Options : Display Colours : Colour Code by Script" from the menu] have been harmonized with the default script colours used for BabelMap, and an option to reset all script colours to their default values has been added to the "Configure Script Colours" dialog (this needs to be selected for the new default colours to be used).

BabelMap Enhancements

Script colours when colour coding by script has been selected are now user configurable ["Options : Customize Colours..." from the menu]

When colour coding of characters has been selected, the character with focus is no longer highlighted in red

The character with focus in the character grid is now indicated by its cell having an inset appearance

Option to rotate of not rotate the glyphs for vertical scripts (Mongolian and Phags-pa) where the selected font has rotated glyphs for vertical layout (in previous versions of BabelMap the glyphs are always rotated) ["Options : Other Options : Rotate Vertical Scripts" from the menu]

The Export Font Glyphs utility has been improved to ensure glyphs are not accidentally clipped in some cases

The Han Radical Lookup utility has been updated to cover CJK-D (now covers all all 74,616 CJK unified ideographs)

The Advanced Character Search utility now has an option to only give the total number of characters matching the selected criteria, and not list them all (this makes searches which return a large number of results, for example when querying how many characters were introduced in a particular version of Unicode, very fast)

Monday, 24 May 2010

Why Windows 7 No Longer Sucks [2011-03-01]

On 22nd February 2011 Windows 7 Service Pack 1 (SP1) was released, and I am very pleased to say that all the rendering issues discussed below are now solved.

Typing PUA Tangut under Windows 7 plus SP1

Internet Explorer 8 under Windows 7 plus SP1

Original Post [2010-05-24]

In many ways Windows 7 is a great improvement on Vista, but this is the sad story of why my children have the use my shiny new Windows 7 laptop, and I am sticking to the old, not very user-friendly and not very reliable Vista laptop. I hope that one day I will be able to write a blog extolling the virtues of Windows 7, but given the contents of the forthcoming Service Pack 1 it seems very unlikely to happen any time soon, and at the current rate of (lack of) progress, I am afraid that Microsoft will lose more and more of the few remaining loyal customers like myself who find it impossible to do cutting edge Unicode stuff with an operating system that values gimmicks over functionality, and for every step forward takes two steps backwards.

Prototyping Tangut IMEs

In anticipation of the eventual encoding of the Tangut script in Unicode, I have been prototyping a couple of Input Methods for Tangut that use the table driven text service that is available in Windows Vista and Windows 7 (see Michael Kaplan's twelve-part series Behold the Table Driven Text Service for a tutorial).

Installing on my Windows Vista laptop I get the following results (the icons, StrokeCode.ico and Alphacode.ico, are a little degraded in the jpgs) :

Tangut Stroke Code IME under Windows Vista

Tangut Alphacode IME under Windows Vista

Hmm, the IMEs both work just fine, but the Tangut characters in the candidate list show up as little squares, which means that if two or more characters share the same alphabetic code sequence you have to guess which character to choose, and even if it is a unique alphabetic code sequence it would be nice to see what the character looks like. Unfortunately, for Vista there is no way to specify what font to use for the candidate window, but as explained by Michael Kaplan in Can't I pick the candidate list font if I don't speak fluent square box?, Windows 7 introduces new FontFaceName and FontSize parameters for the TableTextService file format. So let's install these two IMEs (with Unicode Tangut specified at 16 points) and the Unicode Tangut font on my Windows 7 laptop and see what happens.

Why Windows 7 Sucks

D'oh, that's one step forward and two steps backwards. The candidate window is now using the Unicode Tangut font as specified, but both in the candidate window and in BabelPad the Tangut characters (currently reserved code points) are displayed as little square boxes, in fact two square boxes per character, which suggests that surrogate code points are being rendered separately rather than combined as a single character. But perhaps this a problem with BabelPad; let's see what it looks like with Notepad :

Now, at this point there will be some people who will be saying, "of course your so-called Tangut text doesn't display properly, because you are using unassigned Unicode codepoints". Ignoring the fact that it does display OK in Windows Vista, Windows XP and even Windows 2000 if the Unicode Tangut font is installed (as Tangut is not a complex script from a rendering perspective, it does not need support from Uniscribe to render correctly), let's take a look and see how Windows 7 copes with a recently-encoded script like Egyptian Hieroglyphs which does have officially assigned Unicode characters (NB Egyptian Hieroglyphs render fine under Windows Vista with a font like Aegyptus) :

The Egyptian hieroglyphs render OK in the character grid and in the popup window, because BabelMap does not use Uniscribe, but renders character directly using their glyph ID, as read from the font's CMAP table. But the edit buffer is a standard Windows edit control, which uses Uniscribe, and the Egyptian characters render as square boxes. Let's try again with BabelPad, this time with "Simple Rendering" mode selected, which uses the same method as BabelMap to render characters :

Hmm, that's weird, it only renders the first character in each line correctly. And exactly the same problem is seen in Windows Vista (screenshot omitted), so it is almost certainly a bug in BabelPad (fixed in version 5.2.0.8 released 2010-06-06). But what we have learnt is that if you use Uniscribe under Windows 7 (whether in Notepad or in an edit control or in BabelPad), then you won't see any Egyptian Hieroglyphs. The bottom line is that Windows 7 proudly supports Unicode 5.1, but is not forwardly compatible with later versions of Unicode, including Unicode 5.2 which was released in the same month that Windows 7 was released to the general public. Thus, for example, Phaistos Disc symbols (encoded in Unicode 5.1) render OK under Windows 7 (as evidenced by the fact that they display in the edit buffer of BabelMap) :

All previous versions of Uniscribe have passively allowed text encoded in Unicode characters that it does not recognise to render OK as long as there is font support, but the version of Uniscribe that ships with Windows 7 appears to actively disallow Unicode text that it does not recognise ... or at least, characters in Unicode ranges that it does not recognise (post-Unicode 5.1 characters in existing Unicode blocks will be rendered OK under Windows 7 if there is font support). There is, however, one exception to this: CJK unified ideograph blocks added to the Supplementary Ideographic Plane (SIP) post Unicode 5.1 will render OK if there is font support (presumably Uniscribe treats the SIP as a single range) :

I wonder if Internet Explorer 8 does any better on Windows 7 than Notepad?

Internet Explorer 8 under Windows 7

Nope, just like in Notepad, Unicode 5.1 scripts and CJK Unified Ideographs Extension C render OK, but Egyptian Hieroglyphs and currently reserved character ranges come out as little square boxes. So there you have it, if you want to write in any of the fifteen new scripts added in Unicode 5.2 (Avestan, Bamum, Egyptian Hieroglyphs, Imperial Aramaic, Inscriptional Pahlavi, Inscriptional Parthian, Javanese, Kaithi, Lisu, Meetei Mayek, Old South Arabian, Old Turkic, Samaritan, Tai Tham, and Tai Viet) or any of the various new scripts and symbol blocks that will be added in the forthcoming Unicode 6.0 (Mandaic, Batak and Brahmi scripts, and Playing Cards, Miscellaneous Pictographic Symbols, Emoticons, Transport and Map symbols, and Alchemical Symbols), then my recommendation is to avoid Windows 7.

Phags-pa Rendering on Windows 7

Whilst we are on the subject of Windows 7, let's have a quick look at the rendering of the Phags-pa script in Windows Vista and Windows 7.

Phags-pa is a complex script in rendering terms, and Windows Vista does not actively support the script. Nevertheless, under Windows Vista, Unicode Phags-pa text renders correctly in all respects (joining, contextual shaping and variation sequences) in BabelPad and Notepad using my BabelStone Phags-pa Book font :

Phags-pa text rendered in BabelPad with the BabelStone Phags-pa Book font under Windows Vista

However, under Windows 7, the font is next to useless, as no joining or shaping behaviour is applied :

Phags-pa text rendered in BabelPad with the BabelStone Phags-pa Book font under Windows 7

On the other hand, the same Phags-pa text does render correctly using the Microsoft PhagsPa font that ships with Windows 7 :

Phags-pa text rendered in BabelPad with the Microsoft PhagsPa font under Windows 7

Now, the Microsoft PhagsPa font is in many respects (and not coincidentally) very similar to my BabelStone Phags-pa Book font, but the one crucial difference between the two fonts is the set of OpenType features that are used to control the joining and shaping behaviour of characters. The BabelStone Phags-pa Book font uses the Contextual Ligatures <clig> and Glyph Composition Decomposition <ccmp> features to enable it to do all the joining and shaping stuff, including variation sequences, internally without any need for assistance from Uniscribe. On the other hand, the Microsoft PhagsPa font uses the Initial Forms <init>, Medial Forms <medi> and Terminal Forms <fina> features to do the joining behaviour, and these features rely on Uniscribe. For this reason, the Microsoft PhagsPa font won't work correctly under Windows Vista (no Uniscribe support for Phags-pa), and conversely, the BabelStone Phags-pa Book font won't work correctly under Windows 7 (too much Uniscribe support for Phags-pa). I can't really complain about this, as Microsoft support for Phags-pa would almost inevitably mean making Uniscribe instrumental in the rendering process and using a different set of OpenType features than I used (of necessity) in my font. What I will do, when and if I ever get some free time from Tangut, is release new versions of my Phags-pa fonts that use the same OpenType features as the Microsoft PhagsPa font does.

But there is one added complication. Starting with Windows 7, Microsoft now use the newly defined Format 14 cmap subtable (Unicode Variation Sequences) to process variation sequences, thus by-passing OpenType entirely. In Windows Vista and earlier, variation sequences would work without any special support from Uniscribe by defining glyph substitutions in the font under the Glyph Composition Decomposition <ccmp> OpenType feature. Thus, under Windows Vista it is possible to correctly render Mathematical Variation Sequences by using James Kass' Code2000 font, or Phags-pa variation sequences using my Phags-pa fonts. But under Windows 7, variation sequences no longer render correctly using these fonts. Instead, under Windows 7, Microsoft's Cambria Math font supports Mathematical Variation Sequences, and Microsoft PhagsPa supports Phags-pa variation sequences, by including variation sequence mappings in an additional Format 14 cmap subtable which is accessed by Uniscribe. In my opinion, the use of a cmap subtable to apply variation sequences rather than use simple OpenType features is a very bad idea, as it overcomplicates what is essentially a very simple task, and makes variation sequence support not backwards compatible with versions of Windows prior to Windows 7. Moreover (and from my perspective, more importantly), there is not yet widespread support for the new Format 14 cmap subtable, and the font editor that I use have no short term plans to add support for this subtable, which makes it difficult for amateur font developers like myself to create fonts that use the Windows 7 model for variation sequences.

Finally, the screenshot above shows a variation sequence <U+A86A U+A85E U+FE00> (ꡪꡞ︀) rendered correctly with the Microsoft PhagsPa font on BabelPad (NB this only works on BabelPad version 5.2.0.0 or later, as applications need to set an undocumented flag in Uniscribe [SCRIPT_CONTROL.fMergeNeutralItems = TRUE] for the Format 14 cmap substitutions to work), but take a look what happens when we display the same text on Internet Explorer 8 under Windows 7 :

... the variation sequence (highlighted) is rendered incorrectly as two disconnected glyphs. Looks like Internet Explorer 8 does not yet support the new Format 14 cmap subtable for variation sequences; yet one more example of Microsoft's disconnected thinking across different development teams, and the appalling lack of testing that seems to be par for the course with Microsoft.

2.1 The Sea of Characters

The mid 12th century monoglot Tangut rhyming dictionary, the "Sea of Characters" (Wén Hǎi 文海 in Chinese), provides a compositional analysis of each character

It explains each character in terms of other Tangut characters from which its constituent elements have been borrowed

E.g. explains Character A as being derived from the left side of Character X and the right side of Character Y

The source characters from which an element is said to be derived may have a phonetic or a semantic relationship with the target character (note it is the source character which has the semantic or phonetic function, not the element itself)

Creates a network of interrelated characters—a web of characters rather than a sea of characters

The four characters under the large head character give the character's structural composition, using the following terms (and sometimes others) to indicate what part of the source character is being referred to (as no more than four characters are ever used to describe a character's structural composition, often one or more of these terms are elided) :

𘓳 *ŋowr = "whole"

𘊱 *pha̱ = "left"

𘁝 *nji̱j = "middle"

𗡼 *bji̱r = "right"

𗥦 *ɣu = "head, top"

𗘡 *tśhjɨj = "bottom"

𘍞 *iọ = "surrounding, enclosing"

The description of a character's composition does not always make sense.

Questions :

Can the Sea of Characters analysis be relied on ?

Does its analysis of the structure of Tangut characters accurately reflect the principles by which the Tangut script’s creator or creators devised the individual characters of the script ?

Or is it a later, spurious attempt to explain and rationalize the structure of characters ?

2.2 A Case Study : The "Sun" Radical 𘤊

This component is Nishida Tatsuo's Radical No. 211, which he calls the "sun radical" 日部 (see Seikago no kenkyū 西夏語の研究 [A Study of the Hsi-Hsia Language] page 244). However, very few characters with this component are in any way related to the sun, and so Nishida's radical name is a misnomer (by far the largest semantic group of characters with this component is the "Bird-related" group, but Nishida already has a "bird" radical). As we shall see below, unlike most Chinese radicals, Tangut radicals do not have a single fixed meaning, and so giving names to them (as Nishida and others have done) is at best not very useful, and at worst misleading.

A total of 219 characters (in N3797) with this component as a primary component :

1 character with this component as its only component

1 character with this component at the bottom

119 characters with this component on the left hand side

51 characters with this component on the right hand side

47 characters with this component in the middle

The Sea of Characters dictionary has head entries for 102 of these characters :

The Chinese Radical model does not seem to apply to Tangut. In Chinese a particular radical has a single semantic determinative function, whereas for Tangut the same radical may have many different semantic functions depending upon the source character from which it is taken. Or, more accurately, a particular "radical" does not have an inherent semantic determinative function, but rather, its semantic function in any given character depends on the source character that the radical is derived from. This helps explain why there are so many characters with the 'person" radical 𘢌 (about 20% of all Tangut characters include this element; see Marc Miyake's How Many People are in the Tangut Script?) — this element does not have an inherent sense of "person" in the same way that the Chinese ⼈ *rén "person" radical has, but can mean almost anything depending upon the character that it is derived from.

It is possible to group characters with the 𘤊 radical into several different categories (as shown below), based on its source character as given in the Sea of Characters. By far the largest category of characters are those related to birds and flying, but as this is only one of several semantic categories covered by this radical, you cannot assume that any character with this radical is related to birds or flying (unlike Chinese, for which you can assume that almost any character with the ⿃ *niǎo "bird" or ⾶ *fēi "flying" radical is related to birds or flying respectively). In fact, of the 102 characters in the above table, only about half of them can be grouped into the semantic categories shown below. The other half are idiosyncratic, and have to be considered one at a time.

A. Characters Related to Birds and Flying (𘤊 = 𗿼 *dźjwow "bird")

Of the 32 characters in this category, 24 have the radical on the left, 6 on the right, and 3 in the middle (one character has the radical on the left and in the middle), so although the left-hand side is the most common position for the radical, it is not fixed, and may occur on the right-hand side or in the middle of the character as well.

As discussed in Part 1, modern Tangut dictionaries use systems of radical indexing that are based in arbitrary, artifical radicals. However, it should be possible to generate a list of natural radicals used in the Sea of Characters. The bone-related characters in this category all use the two right-hand elements of the character 𗥛 *rjɨr "bone", and so 𘤊𘠢 would form a single natural radical in a hypothetical Sea of Characters radical system.

Whilst we can classify some of the characters with the 𘤊 radical according to semantic categories, as shown above, we can more usefully classify characters according to the functions of the component elements that comprise each character.

A. Phonetic plus Semantic Constructions

These constructions are similar to Chinese Radical plus Phonetic constructions, but whereas Chinese phonetic elements usually have a narrow range of phonetic values, and the reading of an unknown character can often be guessed from its phonetic element, Tangut phonetic elements do not have a fixed phonetic value, but represent the phonetic value of the character from which the element is derived. Thus the element 𘤏 represents *źjiw in the character 𘀉, but represents *tser in the character 𘀕, and so if an unknown character were to include the element 𘤏, we could not guess what phonetic value it represented ... or even whether it had a phonetic function or a semantic function.

In some cases where a character his no intrinsic meaning, for example family or clan names, the character may be constructed from two homophonous phonetic elements.

𘛯 *gu̱ [a surname] =𘛴 *gu̱ "a spirit" +𘛲 *gu̱ "to patrol"

C. Semantic Constructions

Many characters do not have a phonetic element at all, but comprise two or more elements with a semantic function, which taken together explain the meaning of the character, either directly (e.g. "black" + "bird" = "crow") or indirectly (e.g. "bird" + "wing" = "to fly"). In some cases the semantic elements may help us better understand the meaning of a character. For example, the known meaning of 𗿍 is only "a type of bird", but as its middle component elements comes from a character that means "to assemble", it is probable that the type of bird is one that is usually found in large flocks. As another example, the Lǐ Fànwén dictionary definition for the two characters 𗿷 *dźjij and 𘟣 *dju is the same (有 = "to have" or "to possess"), but the fact that the left side of 𗿷 is derived from the character 𘜶 *ljịj "big" suggests that it may mean something more like "to have much" or "to have everything", and so is subtly different in meaning to 𘟣 (this impression is strengthened by the fact that when reduplicated 𗿷 means "all, every").

Some constructions are difficult to understand, especially family names and place names. Do we lose something in translation ? Or maybe textual corruption in the Sea of Characters has resulted in the correct source character being replaced by a similar but different character ?

Based on this study of a single radical, it seems to me that Tangut does have "radicals", but that they are very different to Chinese radicals. Firstly, each component element of a Tangut character is a radical, so most Tangut characters have two or three radicals. Secondly, Tangut radicals do not have a fixed semantic meaning or phonetic value, but are used to connect one character to another character, so that each character is related through its radicals to two or three other characters, forming a network of interrelated characters. However, as the source character for any given radical is not explicit, but has to be looked up in the Sea of Characters (or some other long lost Tangut reference book), the radicals cannot be used to guess the meaning or pronunciation of an unknown character. At best — and this is perhaps their original intent — radicals can be used as mnemonic devices to help the learner read and write Tangut characters.

2.3 Characters with Unitary Composition

The vast majority of Tangut characters are composed of at least two distinct components, and their compositional analysis in the Sea of Characters describes them as the product of two or more other characters. However, there are a few characters with a unitary composition (30-40, depending upon how you count them), for example the character 𗾆 "waist", which is composed only of the radical 𘤊. In the Sea of Characters, the composition of this and other unitary characters with head entries is described in subtractive terms as deriving from a more complex element with one part removed :

Most of these characters are described using the formula "[component] X of [character] Y removed", and would seem to imply that the simpler unitary character is derived from the more complex character. However, in most of the cases where the source character also has a head entry in the Sea of Characters, we find that there is a circular derivation, for example the character 𗢨 "person" is stated to derive from the character 𘑘 "celestial being" by the removal of its top component, but 𘑘 "celestial being" is stated to derive from the character 𗢨 "person" plus the top of the character 𘑗 *ŋər "mountain". The latter analysis makes a lot of sense as it parallels the construction of the corresponding Chinese character, 仙 xiān "celestial being", which is constructed from the character 人 rén "person" plus the character 山 shān "mountain". On the other hand there is no obvious explanation why the Tangut character for "person" should be derived from "celestial being" minus the "mountain" top. As another example, the character 𘂪 *dzjij "single" is stated to derive from the character 𗄴 *twe̱ "pair, couple" with the left side removed, where the left side of 𗄴 is defined as the left side of the character 𗅋 *mji "not". This is a very confused definition ("single" = "pair" minus "not"), but from it we can reconstruct a very sensible compositional definition for the character 𗄴 *twe̱ "pair, couple" (which does not have a head entry in the Sea of Characters) : 𗄴 *twe̱ "pair, couple" = left side of 𗅋 *mji "not" plus whole of 𘂪 *dzjij "single" (i.e. "pair, couple" = "not" + "single"). In at least these two cases, it seems to me that the compositional definition of the unitary character is simply the inverse of the compositional definition of a complex character that is composed from the unitary character. Other compositional descriptions of unitary characters are also very contrived and unbelievable, such as those for the characters 𘌢 *zu "belt, band" and 𘔧 *gjụ "seat, post, stick", which in both cases derive the unitary character from two halves of the component corresponding to this unitary character in two different complex characters. Thus, I think that unitary characters are not secondary derivations from complex characters as implied by their analysis in the Sea of Characters, but that the compositional definitions of unitary characters are secondary back-formations from the compositional definitions of other characters, perhaps simply intended as mnemonics rather than as a true analysis of the derivation of such characters.

2.4 Characters with Circular Composition

The compositional analysis in the Sea of Characters often includes pairs of characters with circular derivations, as for example the following two examples :

Left side of 𗿼 *dźjwow "bird" ⇔ left side of 𗿤 *dźjwow "mating [of birds]"

Top and left side of 𗿞 *djị "to mate" ⇔ top and left side of 𗿯 *djị "to tread"

These circular constructions are disturbing, as they appear to suggest that the Sea of Characters analyses are flawed.

2.5 Mapping the Web of Characters

In conclusion, the compositional analysis of Tangut characters given in the Sea of Characters does help to understand how characters were constructed, and I think that in most cases the analyses are correct, and do reflect the mechanisms by which the characters were created. However, there are problems with the analyses of unitary characters, and for many characters with circular derivations, which suggest that the analyses in the Sea of Characters cannot be relied on uncritically.

To better understand the mechanisms by which Tangut characters were created, it would be necessary to fully map out the structural relationships between all the characters in the Sea of Characters. This would not only allow us to better understand the process by which the Tangut script was created, but it might also enable us to better understand the meanings of individual characters, or in some cases even to reconstruct or correct the pronunciations of characters.

Appendix I: Note on the Positional Forms of Components

Many or most Tangut components are not positionally fixed, but can occur at the left, middle, right, top or bottom of a character. For example the most common Tangut component, 𘢌 (Nishida's 'person' radical) occurs in all these positions :

L2603 𗣦 (left)

L4593 𗡋 (middle)

L2386 𗁑 (right)

L1883 𗬌 (top)

L1148 𗕌 (bottom)

L3779 𗫾 (left and under)

The 'person' component takes basically the same form in whatever position it occurs, but some components have slightly different forms for the left hand side (or middle) and for the right hand side, with the right hand form often having a bent vertical stroke and sometimes with an additional short slanting stroke at the end :

L1273 𘄟 (left hand form = 𘤒) and L5599 𘂱 (right hand form = 𘤸)

L5064 𘁗 (left hand form = 𘤜) and L3569 𗤣 (right hand form = 𘦙)

L3940 𘞣 (left hand form = 𘫉) and L4447 𘜹 (right hand form = 𘫊)

L5629 𗰈 (middle form = 𘠌) and L5167 𗰆 (right hand form = 𘠴)

In some cases it is difficult to see that different components are actually different positional forms of the same basic component, and it is only by studying the compositional analysis in the Sea of Characters that we are able equate seemingly different components. For example, Nishida's 'water' radical 𘠣 occurs in the form 𘠅 on the right hand side of a character, and in the form 𘡍 on top of a character (see Marc Miyake's Which Way Water?), as demonstrated by the Sea of Characters analyses of L2931 and L5809 :

L2931 𗕆 = left hand side of L2699 𗄻 + middle of L3898 𘀼 + left hand side of L2414 𗋒

L5809 𗕆 = left hand side of L3073 𗊉 + bottom of L4845 𗒿

[Revised: 2010-05-03 and 2010-05-21]

Last modified: 2017-01-01 (updated with Unicode Tangut characters)

If Tangut characters do not display correctly, please download and install the Tangut Yinchuan font.