3 Introduction This document will focus on those issues of the software localization process that must be weight up specifically when you develop or localize a software product to China and Chinese. It is aimed to Western people who are used to work only with computer environment of one-byte ASCII or ISO8859 characters. This paper includes also general information about problems in Chinese information processing, not only in localization. The most important problems in the information processing that will be discussed in this paper are the character sets, fonts, input methods and required national standards in the software development for the Chinese market. In addition to those also available tools for software develepment are studied. Internationalization and localization as specific problems will be discussed in this paper very slightly. Also problems that will be faced in Chinese word-processing software, like word segmenting, are much more complicated and will not be studied here. This is not a how to do localization guide but it contains information that one should be aware of when localizing software into Chinese. This gives IT staff information about common requirements of Chinese software development and help them decide what tools can be used. The first chapter has the basic facts about Chinese language to help to understand the diversity of problems related to the language. The second chapter has general information about software globalization. The chapter three describes problems that will be faced in Chinese software and in the chapter four there is information about products. The Product section is not very detailed because features of products are changing all the time. Several products mentioned in this paper have added the required GB standard support during the past twelve months. 1 A brief introduction to Chinese language The Chinese language is a group of languages that are mostly spoken in Mainland China and Taiwan. It is spoken by more than one billion people and approximately 95 percent of the population of China speak Chinese. Other languages spoken in China are e.g. Tibetian, Mongolian, Lolo, Miao, and Tai which are spoken by minority groups. The majority of the Chinese-speaking population is in Mainland China and Taiwan, but there are Chinese speaking groups also in Singapore, Indonesia, Malaysia, and Thailand. When entering to Chinese market it is important to understand the diversity of the Chinese language. It has seven major language groups and each of them has several dialects. Fortunately there are only two writing forms if we don t speak about languages of minoritiy groups. The combination of the spoken language and the writing form depends on the location. The fact that Mandarin is the official language of China and used character set is Simplified Chinese gives software companies an opportunity to use those and their software will cover almost the whole country. In the following chapters there will be introduced some characteristics of spoken and written Chinese. Export HIS - Localization 3

4 1.1 The spoken Chinese The three biggest spoken Chinese language groups are Mandarin, Wu and Yue. Mandarin is based on a dialect used in Beijing area and it is the most widely spoken covering at least 80 % of the country area and 74 % of the population. Because the area where Mandarin is spoken is so huge, it varies in some amount. Mandarin is the official language of China and it is also one of the four official languages in Singapore. The second biggest spoken language group is Wu ( 吴 ) whose dialects are spoken in Zhejiang and Jiangsu provinces and in Shanghai and Hong Kong and Taiwan by about 80 million people. Its major dialects are Shanghainese and Suzhounese. The third biggest spoken language group is Yue ( 粤 ) which is also called Cantonese. Its dialects are spoken by about 65 million people in Guangdong, eastern part of Guangxi and Hong Kong and parts of Macau etc. Because Mandarin, Wu and Cantonese etc. are spoken Chinese languages, those names should not be used when one needs a written translation but when one needs an interpreter. In Mandarin there are only differently pronounced syllables consisting of about 400 basic pronunciations and four or five tones for most of them. As a result, there are many homophone words which are distinguished in written Chinese by using a different character for almost each one. In China there are also other official languages e.g. Tibetian, Mongolian, Lolo, Miao, and Tai which are spoken by minorities. 1.2 The written Chinese Characters Chinese does not use phonetic alphabets when creating words but characters that represent a whole word or only a part of it. There are altogether more than characters but many of them have overlapping meanings or they are variants of other character. The commonly used characters can easily create more than words. New words are built using existing characters. For example computer is 电 脑 (dian4nao3) in Chinese where 电 (dian4) means electricity and 脑 (nao3) means brain. A number after the syllable expresses its pronunciation tone. The four types of characters There are four types of Chinese characters; pictographs, ideographs, compound characters and semantic-phonetic compounds. Pictographs were originally pictures of things and sometimes the original shape can still be recognized. For example 川 (chuan1) means a river, and 山 (shan1) means a mountain. Ideograps are graphical representation of an abstract idea. For example 三 (san1) means three and 中 (zhong4) means middle. 4 Export HIS - Localization

5 Compound ideographs are built by combining one or more pictographs or ideographs to form a new character. All component parts contribute to the meaning of the compound character. For example 明 (ming2) consists of a sun and a moon and the meaning of it is bright Phonetic ideographs consist of two parts where the first one gives a hint of the meaning of the character and the second one gives an idea of the pronuciation. This is the biggest group, more than 90% of all characters. For example 钟 (zhong1) means a bell and the character is a combination of 钅 (referring to metal) and 中 (zhong1) referring to its pronunciation. Most words are created of two or sometimes more characters and characters are written in a sentence without a space or any other separator between words. Therefore, while reading, in addition to understanding what characters do mean and knowing how to pronounce them, you must also know which characters standing next to each other belong together. The structure of a character Characters are composed of smaller units called radicals and other elements that cannot serve as radicals. One radical in a character can be used for indexing purposes. There are altogether some 200 radicals where amount depends on the radical system and several radicals also stand alone as a character. As an example, 明 (ming2) consists of two elements, 日 and 月 where the first one is a radical that is used for indexing. Radicals and other elements are composed of one or more strokes. Each character has a defined drawing order that is used while drawing strokes. That order of drawing can also be used, as well as the number of strokes, for indexing purposes. The 明 character is composed of eight strokes where the first stroke is a vertical bar and the second one is a hook. When writing Chinese, every character is given exactly the same amount of space, no matter how many strokes it contains. It means that radicals and other elements are stretched and squeezed so that they fit to the square that is reserved for each character. For example character 一 (yi1) has only one stroke but 齉 (nang4) has 36 strokes and they both have to fit to the square of the same size. This document uses normally font size 10 but if we, who are not used to look at Chinese characters, really want to see the latter character the size must be raised up to 24: 一 (yi1), 齉 (nang4) Writing direction In Simplified Chinese (see The simplification) almost all publications and in Traditional Chinese most books and newspapers are printed horizontally from left to right. Vertical direction starting from the right side is used only in advertising and some special cases The simplification There are two main versions of written Chinese: Simplified Chinese and Traditional Chinese. The main objective of simplification was to raise the literacy rate. The simplified writing system recuces the number of strokes per character and the number of characters in common use. The most frequently used and complex characters are written with less strokes in Simplified Chinese. There are approcimately 2500 characters that have a simplified form. Export HIS - Localization 5

6 The Simplified Chinese was implemented in 1952 and it is used in China and Singapore, Traditional Chinese is used in Taiwan, Hong Kong, Macao and overseas Chinese communities The transliteration Because Chinese writing system is based on logographic symbols, there is no way of knowing their pronunciation only by looking at the character. In order to handle the pronunciation there are a few transliteration systems used. Some of these systems use Latin alphabet, e.g., Pinyin and Wade-Giles. These systems are called "romanization systems". There are also other transliteration systems that do not use Latin alphabet e.g. Zhuyin Fuhao. The Pinyin system was developed in China in 1958 and it is now the only transliteration system used in China. It consists of 25 Latin letters (not v ). Pinyin has been the United Nations Standard from 1977 and it is adopted also by ISO (the International Organization for Standardization). The Wade-Giles system was first published by Thomas Francis Wade in 1859 and it was earlier almost the only system in English-speaking countries. It is still popular in Taiwan. As an example that we meet quite often one could mention the name of the capital of China: if it is Peking it is Wade-Giles transliteration but if it is Beijing it is Pinyin transliteration. Zhuyin system does not use Latin alphabets but Chinese character elements to express pronunciation. It is also known as "bopomofo" according to the pronunciation of the first four characters in its character set. It consists of 37 characters and it was first introduced in early 20th century and is still used at least in Taiwan. 1.3 The language summary As a conclusion to the previous chapters, the Simplified Chinese is used in written language and Mandarin, Wu or Cantonese or some other dialect is used in spoken language in Mainland China. In Hong Kong Traditional Chinese is used in writing and Cantonse is mostly used in spoken language. In Taiwan Traditional Chinese is used in writing and Mandarin or Wu is used in spoken language. 6 Export HIS - Localization

7 A rough table of the spoken language, transliteration and characters used in different areas: Transliteration, spoken and written languages in different geographic areas Spoken language Transliteration Written language Mandarin Wu Cantonese Pinyin Wade-Giles Zhuyin Simplified Traditional Mainland China yes yes yes yes yes Hong Kong yes yes yes yes Taiwan yes yes yes yes yes The used written language has an influence on needed character set and the transliteration system has influence on input methods. Because Mandarin is the official spoken language, the Simplified Chinese is the written form of it and Pinyin is the only transliteration system used in Mainland China, software products that conform to those will cover most of China. But you must keep in your mind that this software will not cover Taiwan, HongKong or minorities of China. Export HIS - Localization 7

8 2 The software globalization A software globalization is a process where software is build for use in global market. The first part of building global software is to build an international version that can be localized to different languages and cultures with minimal effort. The localization is a process where features of a specific language and region are adapted to the internationalized software. 2.1 The internationalization The internationalization is a phase of a software globalization where linguistic and cultural dependent parts are isolated from the core. This requires extracting all language, culturally and country dependent elements and while creating the core software there is no assumptations of the target language or region. It is possible to localize a software product that has not been internationalised at all. This requires a lot of work that has to be repeated every time for a new version of the product. This is not recommended at any circumstances. Another option is to first internationalise the existing product and then localize it. Using this method must be thought through because multilinguality has so deep impact on the software architecture that it is difficult to avoid problems. If a product has originally been developed for international use, and cultural and linguistic parts have been isolated in the first step, it will be easier to build localized versions to different cultural and linguistic regions later. Because internationalization leads to a more modular product, support and maintainance of local versions of the software will be simplier. Also getting new versions of internationalised software to different markets will be much faster and less expensive. If there is only one code to be used everywhere it saves money in all levels: corrections only to one place, one compiled version of the software, debugging only in one version and much less testing. These all means less hours spent with software and less provided hardware. Technical support is easier with one code and it is better service for international customer if the product has no differencies in different countries. There are two models for internationalization: The locale model implements a set of attributes for specific locales and one of those attributes is a character set. The character set is specific to a given culture, region or locale and it cannot be changed if locale is not changed. In the multilingual model the system can use a character set that contains all characters that are necessary for several cultures or regions. A multilingual software can be carried out by using the Unicode character set. In the internationalization process the following parts of the software product should be set isolated from the core: window titles text strings, also text that is embedded to graphics menus dialogues presentation information (position, size, font, colour, intensity, text orientation) error and confirmation messages help text report layouts 8 Export HIS - Localization

9 symbols graphics icons sounds cursor shape hot keys function calls, input arguments and output data strings communicated between components folder names constants formulas (tax, pension, ) functions to change data from internal to external form and vice versa all references to devices all references to operating system all software services that have user interaction The code of a localizable product has three basic requirements: The code supports localization with no modification to executable code. The code operates properly with all target character systems e.g. double or multibyte systems. The code operates correctly with texts in several different languages. Outside of the basic software product there is also a need for a localizable documentation and help system. There are a couple of useful websites for checking the internationalization features: Microsoft: Sun Microsystems: 2.2 Linguistic tools in software localization In addition to software development products you can ease our localization work by using linguistic tools. Linguistic tools are planned to face challenges of translating textual parts of the software product. Terminology management systems Terminology is the foundation of a good documentation and translation. The purpose of a terminology management system is to collect and organize terminological data by building a terminological database that contains entries in different languages. It can help communication between original writers and those who carry out the localization. Translation memories Translation memory is a database where previous translations, the source and the target language text, are collected to a translation database. When the subsequent version of the Export HIS - Localization 9

10 same source is compared with the original, the previous translation is inserted into the new target text. Machine translation Machine translation system performs linguistic analysis on the source text and translates them into the target language. Because machines cannot deal with ambiguity in the way that humans can, the result is not comparable to human translation but it can be used after postediting. Machine translation works best on unambiguous texts but also in that case the basic requirement is good dictionaries used in the translation system. Globalization Management systems Globalization Management systems are tools for translation of large and constantly changing websites. It consists of an engine that monitors site content and a component that passes content to translators or other linguistic tools for further processing. It also manages the workflow and synchronisation of translated content with the source-language website. Translation workbenches A translation workbench is an integrated set of tools that support localization. There are a lot of companies that have products mentioned above. They have a whole translation workbench or a smaller part of it. Besides tools they often also offer globalization services. 10 Export HIS - Localization

11 3 Localization problems This chapter will discuss problems that will be faced in Chinese software. The biggest issue to concern is the character set and problems following that. In addition, there are linquistic and cultural aspects to handle as well as technical problems to be solved. 3.1 Characters The first question most western people ask when discussing Chinese software is the question of characters. It is a big question and leads to other questions e.g. character sets, encodings, fonts and sorting Character sets and encodings A charcter set is a bunch of characters that have been set to a group based on some reason, for example a language or a group of languages or some other reason. Some character sets are non-coded, some are coded. Characters in a coded character set have been mapped to a numeric value and can be used by computers. An encoding is the process where numeric values are set to characters. Different encodings set usually different numeric values to each character. Sometimes when a encoding is defined as add-on to an existing encoding, old values have been used when possible. One digital byte can represent only 256 characters and there are more than Chinese characters. Though all of them do not belong to any coded character set it is obvious that more than one byte must be used for encoding. Though a Chinese character is composed of two or more bytes, it needs to be treated as one character during all operations. In other words, when user is treating one character, all bytes that belong to this character must be treated e.g. while deleting a character, all its bytes need to be deleted, otherwise data is corrupted. From 1981 GB was the official encoded character set of China. GB in the name means GuoBiao, national standard. It includes almost 7000 Simplified Chinese characters in two sets. The first set, 3755 frequently used characters, is arranged by pronunciation (Pinyin transliteration). Another set, 3008 less frequently used characters, is arranged by radical and then by number of strokes. GB also includes Zhuyin symbols, Pinyin vowels with tone marks, Latin alphabets, numerals in various series, punctuation and Japanese kanas and Greek and Cyrillic alhpabets as well as some other symbols. GB is compatible with Unicode 2.1 and it has code point for CJK (Chinese, Japanese and Korean) characters. GBK is an extension to GB It is not a standard but an encoding specification of Hanzi (Chinese character used in Chinese), implemented in Last alphabet K is the first letter of Kuozhan that means extension. It consists of GB characters, GB characters and some other having altogether code points and characters. It has also mapping to Unicode 2.1. The latest encoding standard GB was released on March, 2000 and updated on May Its official name is Chinese National Standard GB : Information Technology - Chinese ideograms coded character set for information interchange - Extension Export HIS - Localization 11

12 for the basic set ( 资 讯 技 术 - 资 讯 交 换 用 汉 字 编 码 字 元 集 - 基 本 集 的 扩 充 ). GB was created as an update to GB for Unicode 3.0. GB has, among others, the following properties: It is backward compatible with the previous official encoding standard GB having the same numeric value for each character of GB It includes all GB characters (about Chinese characters) It includes all characters that are in Unicode 3.0 (6582 Chinese characters more) It includes characters of Tibetian, Monogolian, Yi and Uyghur languages. Characters are encoded in one, two, or four-byte sequences. It has 1.6 million byte sequences, where about 500,000 are currently unassigned. It provides code space for all used and unused code points of Unicode's plane 0 (BMP) and its 16 additional planes. It has mapping table to Unicode 3.0 code points. All language related products in China must be able to use all the characters in GB and any product released on or after the 1st of September 2001 must be certified by one of the following groups. A+ The product supports the input, output, edit and display of all characters in GB , including minority scripts. A The product supports the input, output, edit and display of all characters in GB , excluding minority scripts. The product must not corrupt minority characters even if it does not have fonts to display them. GB applies to the processing, interchange, storage, transmission, display, input and output of graphical character information. There is more discussion about GB in products in the chapter 4. Software development products Fonts It is not enough to define the character set and its enodings but there must also be a digital definition of the printable form for each character. Printable forms have different fonts. The following examples of Chinese fonts are from the web page of URW++ Design & Development Company from Germany. Fang Song Hei Song Yuan Hupo 12 Export HIS - Localization

13 Kai Lishu Wei Bei Zong The amount of characters in Chinese character sets is a constraint on font providers. Only a few companies have resources to create fonts for thousands of characters. To mention a few of them, there are at the moment Bitstream, Founder Group, Changzhou SinoType Technology Co and Zhong Yi Electronics. Bitmapped fonts If a character set has bitmapped font it means that every character is constructed as a dotmatrix. They are not user friendly to scale to a bigger size and if the provider wants to offer different size of readable characters, one should design a new set of fonts for each size. Outline fonts Outline font characters are constructed from outlines what means that each character is described mathematically as a sequence of line segments and curves. The outline is scaled to the requested size, then filled and converted to a bitmapped image to the output device. An outline font character can be used at any size and designer needs to design only a single point size characters. PostScript is a page-description language developed by Adobe Systems. It supports both text and graphics and provides built-in support for fonts. It has several font formats of which Type 1 is the most widely used. In PostScript each glyph in a character set has a unique CID (character id) that is a numeric value independent of any encoding. Encoding is associated to CIDs in Cmap files, where an encoding range is associated with a CID range. The ranges can be short or long and a row in the Cmap file consists of three numbers, the first and the last values of encoding and the third one is the starting point of CID from which encoded values are associated with CIDs. TrueType font format was originally developed in 1980s by Microsoft and Apple. Later they have developed TrueType separately and now the two font formats are incompatible with each another. All TrueType fonts contain "cmap" tables that map the glyphs to encodings. In 1996 PostScript and TrueType were merged to OpenType standard that was developed by Adobe Systems and Microsoft. The following companies have developed GB fonts that have been certified to use in software products in China. Agfa Monotype Corporation has two fonts, Hei Bold and Sung Light that have been approved in 2002 by the Committee on Information Technology Standards (CITS) and the State Language Committee (SLC) for distribution within China. They are from Agfa Monotype s WorldType multilingual font library and include full support for the Chinese character set standard GB Export HIS - Localization 13

14 Agfa Monotype Hei Bold Agfa Monotype Sung Light Agfa Monotype specializes in fonts and font technologies for graphic professionals, software developers and manufacturers of printers and display devices. Agfa Monotype is a subsidiary of Agfa Corporation and is part of Agfa s Graphic Systems business unit. Agfa is the U.S. subsidiary of the Agfa-Gevaert Group, one of the world s leading imaging companies. Bitstream Incorporation from the United States has GB font Hei approved on June 28, 2005 by the CITS and the SLC for distribution within China. Bitstream Incorporation is a software development company that enables customers worldwide to render high-quality text, browse the Web on wireless devices, select from the largest collection of fonts online, and customize documents over the Internet. Its core competencies include browsing, font, and publishing technologies. Its library includes over 1,000 high-quality fonts in OpenType, TrueType, and PostScript Type 1 formats for Windows, the Macintosh, Unix and Linux. Beijing Founder Electronic Co., Ltd has developed TrueType GB fonts ShuSong, Hei, Kai, FangSong and SongYi. In 2001 those fonts passed the national certification and were accepted to distribution in China. Font samples: ShuSong SongYi Kai FangSong 14 Export HIS - Localization

15 Hei Founder Group is a leading provider of advanced information technology, software products, collaborative business solutions, and other value-added services. Beijing Founder Electronics Co., Ltd. (Founder Electronics), a subsidiary of the Founder Group, has emerged from a small peripheral R&D department of Beijing University to one of today's largest technology companies in China. Founder Electronics provides its products and services to customers whose businesses cover a wide range of industries worldwide, including newspapers, commercial publishing, printing, broadcasting, TV, Internet, Libraries, and Government administration. Beijing Founder Electronic Co., Ltd. has developed also Founder Super Font, which includes Chinese characters. Committed to promoting the technology of Chinese Fonts and developing fonts for Chinese character printing, Founder Group offers more than ten new fonts every year. Meanwhile, the company actively participates in formulating the relevant Chinese National Standard and International ISO Standard, e.g. GBK, GB and ISO/IEC etc. Changzhou SinoType Technology Co., Ltd has developed STSong Light font that is certified in 2002 by the Press and Publication Administration of China and CSLC and the National Typeface Committee. Hunan Huatian Information Industry CO., Ltd has four GB truetype fonts, Song, Fang Song, Kai and Hei. There is no information about the year of their approval by CITS or SLC for distribution in China. Hunan Huatian Information Industry Co., Ltd. is a high-technology enterprise whose goal is in developing national information industry and researching and developing the international advanced information technology and products. ZhongYi Electronic Ltd has Song, Hei, Kai and FangSong GB fonts that have been approved by China State Language Committee, China State Press Publication and National Printing Font Committee. ZhongYi Electronic Ltd from Beijing, is the affiliated company of Chinese Standard Technology Ltd. The core technologies of the company include input method, full text search engine which supports super large character set. ZhongYi provides font customization services to generate Bitmap, TrueType, Postscript, OpenType fonts in accordance with customer's requirements Sorting and indexing Chinese written text is often sorted by Pinyin reading. Because Pinyin transliteration uses Latin characters, it is therefore sorted according to the English alphabet. Sorting can also be done according to the character structure. Export HIS - Localization 15

16 Indexing is an essential issue when trying to find the right character. It can be done in various orders e.g. phonetic, radical + number of remaining strokes and number of strokes. In the phonetic order characters are ordered according to the pronunciation in Pinyin transliteration system. Because there are lot of homophones in Chinese languages every item in index has several entries. And because some Chinese characters have multiple readings, phonetic indexing has multiple entries for that character. There are also characters that do not have any well-known reading. In that case character does not have any entry in the phonetic index. If a character does not have a phonetic index, it may have a radical index. Radicals and radical-like elements are the basic blocks of a character. Radical index has two levels where the first table consists of radicals with their identifying numbers. The other table consists of all radicals and there are, under each radical, a list of characters that have requested radical as indexing radical. Those characters under the radical are ordered according to the number of remaining strokes in the character. When using radical index one must first recognize the radical of the character and then count the number of remaining strokes. Sometimes it is difficult to recognize the radical of a character but in that case one can use the third method of indexing. It uses the number of total strokes including also strokes of the radical. If characters were grouped only by this number there would be too many characters in a group. This is why there is a second level also used. The most common way is to order strokes in a stroke count by radical. The other method is to order them according to the shape of their first strokes where system offers five different strokes. It is usually enough to use the first stroke that gives sufficient small amount of characters to find the requested one. If some stroke count includes a large number of characters two first strokes are used. There are also characters with ambiguous stroke counts because there are multiple ways to write their components. A good indexing system has cognisance of this and have multiple entries in the index for those characters. 3.2 Linguistic aspects Text Input Text input is quite complicated in Chinese. You must be able to enter both Chinese characters and Latin alphabet because there can be Latin alphabet in the middle of Chinese text, e.g. common abbreviations. Because Chinese language has a huge number of characters it is not possible to have a keyboard with a separate key for each character. Therefore there are a number of methods for entering characters with a QWERTY keyboard. Some methods let also enter characters with a device with much less keys, e.g. cellphone while sending sms-messages. Input software applications are called IMEs (input method editors) and they run as a separate process that is integrated to application. Text is input in two phases: in the first phase user types keyboard input which the computer interprets, depending on the input method, to a list of candidates that refer to characters that are mapped to the input string. In the second phase user selects one candidate from the list. If there are more candidate characters than can be shown at a time, user can request for more candidates. Input method can be based on pronunciation, character structure or encoding or it can use multiple criterias. Pinyin method is based on pronunciation and a person who is familiar with 16 Export HIS - Localization

17 pinyin will learn the pinyin method very quickly. Wubizixing method is based on a character structure and it is not easy to learn but, once learned, it is much faster than any phonetic method. There are also methods that use both pronunciation and structure. In this paper I will introduce only a few input methods Input by pronunciation There are two units by which input can be converted to Chinese characters, a single character at a time or a string of two or more characters. User enters text in two phases: the first phase is typing a keyboard input that the computer interprets to a list of candidates and the second phase is selecting one candidate from the list or requesting for more candidates. If we convert input to characters one by one there will be a lot of candidates to choose for each character but if we enter input for two or more characters at a time there will be much less candidates and entering will be much faster. For example, if we want to write 汉 字 (han4zi4) that means Chinese character, the difference is obvious. Selecting chracters one by one; first han, then zi. han gives you 21 candidate to choose: 撼 喊 酣 蚶 函 涵 汉 干 汗 罕 鼾 韩 翰 瀚 颌 寒 含 旱 捍 悍 焊 from where you choose the seventh character. zi gives 27 candidates: 子 字 仔 籽 渍 自 滓 髭 龇 姿 咨 恣 资 赀 訾 紫 兹 滋 孳 缁 淄 辎 锱 姊 梓 孜 吱 from where you choose the second character. They both would have less choices if tone numbers were used after pinyin. han4 would give 12 candidates and zi4 would give 4. If you are in a mode where you can enter input for two or more characters at a time, entering hanzi gives the right choice 汉 字 at once. This example is produced with NJStar Chinese WP 5.01 product. Input methods by pronunciation Input by pronunciation is most frequentely used input method. It can use Pinyin or Zhuyin. Pinyin is based on Latin alphabet and Zhuyin has its own symbols that are elements of Chinese characters. Pinyin Input method There are three types of Pinyin input: Full Pinyin, Double Pinyin and Half Pinyin. Full Pinyin means writing the pronunciation of a Chinese character with Pinyin just as it is. This requires one to six keystrokes per character. In Double Pinyin certain letter combinations of Pinyin are replaced by one letter according to specific rules. Double Pinyin first devides Pinyin reading into two parts and then uses replacing characters of letter combinations. This requires one or two keystrokes per character. For example SHUANG in Full Pinyin is IH in Double Pinyin: SHUANG is devided into two parts, SH and UANG. In Double Pinyin letter combination definitions SH=I and UANG=H and so SHUANG can be entered by two keystokes I and H. Export HIS - Localization 17

18 In Half Pinyin is between Pinyin and Double Pinyin and it requires one to three keystrokes per character. For example SHUANG in Full Pinyin is UUH in Half Pinyin: SHUANG is devided into two parts, SH and UANG. In Half Pinyin letter combination definitions SH=U and UANG=UH and so SHUANG can be entered by three keystokes UUH. Zhuyin Input Method Zhuyin system uses Chinese character elements to express pronunciation. It is also known as "bopomofo" according to the pronunciation of the first four symbols in its character set. This set consists of 37 characters and it is releated to pinyin in sence that same sounds can be expressed by both systems. Here we have, as an example, a picture of a Zhuyin keyboard that is shown in NJStar wordprocessing software. The pronunciation of a character is not always known and in that case one must use some other input method Input by structure Data input can also base on the structure of a character, e.g. a radical, number of strokes, stroke shapes or corners. Nowadays there is also software that can recognize a whole character. Input by whole character shape On the software where the whole character is recognized, a character is drawn e.g. with a mouse or electronic pen. If the pronunciation of a character is not known, this method is very fast. Here is an example of the Handwritting input of NJStar Chinese WP 5.01 and Chinese Pen. The character 高 has been written on the screen by moving the mouse. 18 Export HIS - Localization

19 As soon as I finish drawing the target character 高 will appear on the place where the cursor was standing. 丨 Input by radical Export HIS - Localization 19

20 Characters are composed of smaller units called radicals and other elements that cannot serve as radicals. One radical in a character is used for indexing purposes. There are altogether 214 radicals and they all have names and numbers and a certain number of strokes. Characters are entered in two phases: in the first phase one enters the name or number of the radical of the character and computer returns a list of characters that have the requested radical. In the second phase one selects one candidate from the list or requests for more candidates. Sometimes indexing radical is not obvious. In that case one can count the number of strokes and use another method. Input by number of strokes Except a few characters, they all have a unique number of strokes. In this method user is entering the number of strokes of a character and computer returns a list of candidates. Input by stroke shapes Wubizixing method is based on a character s overall structure and its first, second and final stroke. It is not easy to learn but once learned, it is very fast method because every character can be written with at most 5 keystrokes. Wubihua method is based on the first five strokes and it needs only five keys and can be used on a numeric keyboard. It is easy to learn but very slow to use because every input gives too many candidates to choose from Input by encoding Input by encoding-method is based on fixed encoded values for each character. It means that one can enter a charcter code and get the only matching character without candidates. This method is very fast if you remember codes, most often in hexadecimal values Text orientation In Simplified Chinese almost all publications are printed horizontally from left to right. Vertical direction starting from the right side is used only in advertising and some special cases. Software user interface has text in its titles, menus, input property names and input area, action buttons and user messages. Besides user interface, product has also printed or digital documents like installation guide, system and maintenance documents, training material etc. Both user interface and other material of the software products can be written in horizontal direction Word separation, word wrapping and hyphenation In western languages words are separated by a space between. Chinese language does not separate words at all but it is up to a reader to understand what characters standing next to each other belong to one word. One can only trust that characters in different sides of punctuation marks do not form a word. Hyphenation is not a problem in Chinese because the language does not have inflections as the case is in Western languages and e.g. in Japanese. There are also a few characters that should not begin or terminate a line. Most of them are punctuation and enclosing characters. Characters that may not begin a line are punctuation marks (e.g.,. : ; -?!), symbols that need to be close to previous text (e.g. % ), closing quotes ( ), closing parenthesis, closing bracket etc. (} ) ] ). Characters that may not terminate a line are opening quotes ( ), symbols that need to be 20 Export HIS - Localization

21 connected to the following text $ #), closing parenthesis, closing bracket etc. ({ ( [ ) Punctuation In Chinese some punctuation marks are the same that are used in Western languages but they occupy the same size square that is reserved for characters around it. The following list has punctuation marks that are used both in Chinese and for example in English. a comma, a semi-colon ; a colon : an exclamation mark! a question mark? parenthesis and brackets ( ) [ ] There are also some differences: A period: In Chinese there is usually no period (.) used in the end of a sentence but a small circle ( ). The period is used sometimes in technical texts. Quotation marks: There are four kinds of quotation marks: and as in English and besides those also and. A caesura: An ellipsis: A dash: Hyphens: It expresses a short pause and it is marked ( ). It is used as a serial comma and used between equal items in a list. In Chinese ellipsis is expressed with six dots, taking size of two characters (two groups of three dots) ( ). A dash is expressed with ( ) and it takes size of two characters. There are four kinds of hyphens that take space of 0.5 to 2 characters. ( ) takes one character space, ( ) takes two characters space, (-) takes half character space and (~) takes one character space. Separating dot: A separating dot (.) is used to separate characters in a name of a foreigner or volume and chapter names of a book. For example 列 奧 納 多. 達. 芬 奇 is Leonardo da Vinci and 中 国 大 百 科 全 书. 物 理 学 is Chinese big encyclopedia, Physics. Book title marks: While Western languages use quotes around book titles, Chinese has special marks to that: and. A proper noun mark: If a noun word is underlined, it means that it is a proper noun. It is used occasionally in teaching material. Export HIS - Localization 21

22 3.3 Cultural aspects Numerals Both Arabic numerals and Chinese characters meaning numerals are used in China. Arabic numerals are used in mathematics and many other occasions but Chinese characters are mostly used if there are numerals among text. If Chinese characters are used they do not have spaces separating numbers from the text. There are also specific characters that are used for numerals in financial documents. Zero is expressed with an Arabic numeral 0, a Chinese zero 〇 (ling2) or a Chinese character 零 (ling2). Two can be expressed with an Arabic numeral 2 or two different Chinese characters, 二 (er4) or 两 (liang3). Their usage has specific rules. The most difficult difference in Chinese numerals compared to Western languages is the way how units are grouped. In Western languages a unit has a new name after each thousand, in Chinese the unit changes after ten thousand: after 千 (qian1)=thousand comes 万 (wan4)=ten thousand 10E4, 亿 (yi4)=a houndred million 10E8, 兆 (zhao4)=a million million 10E12. Cardinal numbers in Chinese characters: 〇 ling2 zero 零 ling2 zero 一 yi1 one 二 er4 two 两 liang3 two 三 san1 three 四 si4 four 五 wu3 five 六 liu4 six 七 qi1 seven 八 ba1 eight 九 jiu3 nine 十 shi2 ten 百 bai3 hundred 千 qian1 thousand 万 wan4 ten thousand 亿 yi4 a hundred million (wan4 wan4) 兆 zhao4 a million million (wan4yi4) Ordinal numbers are formed by adding a character 第 (di4) before the numeral, e.g. 第 三 (di4san1) means the third. There are also nouns with which one uses ordinals in English and Finnish but cardinals in Chinese, e.g. the 3 rd floor is in Chinese 三 楼 (san1lou3). Twenty and thirty are normally written 二 十 and 三 十 but they also have shorthand characters 廿 nian4 twenty 卅 sa4 thirty 22 Export HIS - Localization

23 There are also different characters used for numbers in financial documents. 零 ling2 zero 壹 yi1 one 贰 er4 two 叄 san1 three 參 san1 three 肆 si4 four 伍 wu3 five 陆 liu4 six 柒 qi1 seven 捌 ba1 eight 玖 jiu3 nine 拾 shi2 ten 佰 bai3 hundred 仟 qian1 thousand Digits in telephone numbers are in groups of three. If it is a question of a year the number can be expressed in several different ways: For example a year 2005 have the following forms: 2005, 两 〇 〇 五, 二 零 零 五, 两 零 零 五 If it is not a question of a year, 2005 is 两 千 零 五 and 2000 is 二 千 or 两 千. A negative mark is a hyphen (-) and a decimal separator is a period (.). In case Arabic numerals are used the thousand s separator is a comma (,). Fractions and percentages can also be expressed in Chinese characters Date and Time In the Chinese culture things are usually expressed from a bigger unit to a smaller one. That is why dates are always in form year-month-day. A date can be expressed in Arabic numerals or Chinese characters. If Arabic numerals are used the separator can always be a hyphen (-). If Chinese characters are used year, month and day can be separated by Chinese characters that mean a year 年 (nian2), a month 月 (yue4) and a day 日 (ri4) or 号 (hao4). A year in Arabic numerals can be expressed in two or four digits. With Chinese characters all four characters are expressed. If a year has zeros, the year expressed in Chinese characters use Chinese zero, not Arabic one. Year 2005 can be expressed in several ways e.g. 05, 2005, 2005 年, 二 〇 〇 五 年, 二 零 零 五 年, 两 零 零 五 年. Export HIS - Localization 23

24 A month in Arabic numerals can be expressed in one or two digits. Months can also be expressed with their Chinese names that consist of a number of the month and character that means a month 月 (yue4): 一 月, 二 月, 三 月, 四 月, 五 月, 六 月, 七 月, 八 月, 九 月, 十 月, 十 一 月, 十 二 月 A day in Arabic numerals can be expressed in one or two digits. Days can also be expressed with Chinese characters that consist of a number of the day and a character that means a day 日 (ri4) or 号 (hao4), e.g. 一 日, 二 日,, 三 十 日, 三 十 一 日 Morning and afternoon, if they want to be expressed, are expressed with 上 午 (shang4wu3) for morning and 下 午 (xia4wu3) for afternoon. They can be a part of an expression of time or as themselves expressing the whole morning or afternoon. As a conclusion, the date of September the 17 th ways: 年 9 月 17 日 二 〇 〇 五 年 九 月 十 七 日 二 零 零 五 年 九 月 十 七 日 两 零 零 五 年 九 月 十 七 日 in 2005, can be in Chinese in the following Time can be expressed with Arabic numerals or Chinese character, depending on the situation. If time is expressed in Arabic numerals, the separator between hours, minutes and seconds is always a colon (:). If it is expressed with Chinese characters, the hour separator is 点 (dian3), the minute separator is 分 (fen1) and the second separator is 秒 (miao3). Also expressions 半 (ban4) that means a half and 刻 (ke4) that means a quarter can be used. As an example, time 18:12 can be expressed in several different ways e.g. 18:12 6:12 下 午 (xia4wu3) 六 点 十 二 六 点 十 二 分 下 午 六 点 十 二 十 八 点 十 二 分 (liu4dian3shi2er4) (liu4dian3shi2er4fen1) (xia4wu3liu4dian3shi2er4) (shi2ba1dian3shi2er4fen1) But you can never say 6:12 PM. If all units are expressed they must be in the following order: Year month date - day of the week morning or afternoon hour:minutes:seconds. 24 Export HIS - Localization

25 3.3.3 Currency The Chinese currency is Renminbi, RMB that is called 元 (yuan2). Its symbol is. It has two smaller units called 角 (jiao3) and 分 (fen1). One Yuan is ten Jiaos and one Jiao is ten Fens. In everyday speech Yuan is called 块 (kuai4) and Jiao is called 毛 (mao2). A period (.) separates smaller units from Yuan and number of desimal digits is two. is set before the amount of money e.g A negative sum of a money is expressed with a hyphen ( ) in front of e.g If the amount is bigger than 999 Yuan, the thousand-separator is a comma (,) as the case is with other numbers. The amount of money can also be expressed with Chinese characters representing numerals, e.g. 100 元 is the same as 百 元 Measurement Units The SI (metric) system is used in China. The mandatory GB standard is equivalent to ISO 1000:1992 and consists of SI units and recommendations for their usage. GB standard is equivalent to the ISO 31-0:1992 and has general principles concerning quantities, units and symbols in scientific and educational documents. There exists also Chinese measurement units and if they are used, there might be a need for conversion tables. In the following there are conversion tables for units of length, area, weight and capacity. Units of Length: 1 寸 (cun4) = 3,333 cm 1 尺 (chi3) = 33,33 cm = 10 cun 1 丈 (zhang4) = 3,333 m = 10 chi 1 引 (yin3) = m 1 里 (li3) = 500 m 1 公 里 (gong1li3) = 1 km Units of Area: 1 平 方 英 尺 ping2fang1ying1chi3 = square chi = 1/9 m² = 11,111 dm² 1 亩 (mu3) = 60 square zhang = 1/15 hm² = ares = 2000/3 m² 1 顷 (qing3) = 100 mu = hectares 1 平 方 里 (ping2fang1li3)= 1 square li = 25 ha Units of Weight 1 钱 (qian2) = 5 g 1 两 (liang3) = 50 g = 10 qian 1 斤 (jin1) = 500 g = 10 liang Export HIS - Localization 25

The Unicode Standard Version 8.0 Core Specification To learn about the latest version of the Unicode Standard, see http://www.unicode.org/versions/latest/. Many of the designations used by manufacturers

Hello, fellow colleagues in Translation industry. And, Thank you very much for nice introduction. Vanessa. When you hear the topic Asian Languages and Markets, each of you probably had some questions or

Chapter 2 Encoding Text with a Small Alphabet Given the nature of the Internet, we can break the process of understanding how information is transmitted into two components. First, we have to figure out

The Chinese Language and Language Planning in China By Na Liu, Center for Applied Linguistics This brief introduces the Chinese language and its varieties and describes Chinese language planning initiatives

Frequently Asked Questions on character sets and languages in MT and MX free format fields Version Final 17 January 2008 Preface The Frequently Asked Questions (FAQs) on character sets and languages that

How can I insert special characters, such as dingbats and accented letters, in my document? Article contributed by Suzanne Barnhill Many Word users don't realize how easy it is to insert special characters.

Hong Kong University Press Style Guide (July 2014) The Press uses Merriam-Webster s Collegiate Dictionary (online version) as a general guide for spelling and hyphenation, and The Chicago Manual of Style

Quickstart Guide Connect your microphone When you plug your microphone into your PC, an audio event window may open. If this happens, verify what is highlighted in that window before closing it. If you

ELFRING FONTS UPC BAR CODES This package includes five UPC-A and five UPC-E bar code fonts in both TrueType and PostScript formats, a Windows utility, BarUPC, which helps you make bar codes, and Visual

PREPARING WEB SITES FOR STREAMLINED LOCALIZATION page 1 of 9 Preparing Web Sites for Streamlined Localization This L&H White Paper is designed for any company, organization, or institution that anticipates

INTRODUCTION TO EXCEL 1 INTRODUCTION Anyone who has used a computer for more than just playing games will be aware of spreadsheets A spreadsheet is a versatile computer program (package) that enables you

Lesson Notes Author: Pamela Schmidt Tables Text Fields (Default) Text or combinations of text and numbers, as well as numbers that don't require calculations, such as phone numbers. or the length set by

Chapter 2 Text Processing with the Command Line Interface Abstract This chapter aims to help demystify the command line interface that is commonly used in UNIX and UNIX-like systems such as Linux and Mac

C h a p t e r 6 While Loops and Animations In this chapter, you will learn how to use the following AutoLISP functions to World Class standards: 1. The Advantage of Using While Loops and Animation Code

How to translate your website An overview of the steps to take if you are about to embark on a website localization project. Getting Started Translating websites can be an expensive and complex process.

San José, February 16, 2001 Feel free to distribute this text (version 1.4) including the author s e-mail address (mailto:dmeyer@adobe.com) and to contact him for corrections and additions. Please do not

Book Builder Training Materials Using Book Builder September 2014 Prepared by WDI, Inc. Table of Contents Introduction --------------------------------------------------------------------------------------------------------------------

The Principle of Translation Management Systems Computer-aided translations with the help of translation memory technology deliver numerous advantages. Nevertheless, many enterprises have not yet or only

Opening Screen Access 2010 launches with a window allowing you to: create a new database from a template; create a new template from scratch; or open an existing database. Open existing Templates Create

Creating APA Style Research Papers (6th Ed.) All the recommended formatting in this guide was created with Microsoft Word 2010 for Windows and Word 2011 for Mac. If you are going to use another version

Intro to Excel spreadsheets What are the objectives of this document? The objectives of document are: 1. Familiarize you with what a spreadsheet is, how it works, and what its capabilities are; 2. Using

ClaroRead SE for Mac User Guide! Welcome to ClaroRead SE Welcome to ClaroRead SE for Mac. ClaroRead SE is designed to help make your computer easier to use. It is closely integrated with Pages and Microsoft

Character Codes for Modern Computers This lecture covers the standard ways in which characters are stored in modern computers. There are five main classes of characters. 1. Alphabetic characters: upper

Using SAP Smart Forms for Bar Code Label Printing from mysap Business Suite A ZEBRA BLACK&WHITE PAPER Copyrights 2007 ZIH Corp. All product names and numbers are Zebra trademarks, and Zebra and the Zebra

Internationalization of the Domain Name System: The Next Big Step in a Multilingual Internet Tan Tin Wee 1, James Seng 2, and S.Maniam 2 1 National University of Singapore, Singapore 119260 2 i-dns.net

Mac OS X 10 Using the Keyboard Viewer and Character Palette Use Keyboard Viewer to see the layout of characters on your keyboard for your chosen language. For example, if the U.S. is selected in the Input

Rendering/Layout Engine for Complex script Pema Geyleg pgeyleg@dit.gov.bt Overview What is the Layout Engine/ Rendering? What is complex text? Types of rendering engine? How does it work? How does it support

The Adobe PostScript Printing Primer A do be Post Script Contents Since Adobe Systems introduced the PostScript standard in 1985, Adobe PostScript software has sparked a revolution in how we communicate

Typographic Terms When older typesetting methods gave way to electronic publishing, certain traditional terms got carried along. Today we use a mix of old and new terminology to describe typography. Alignment

Prescribed Specialised Services 2015/16 Shadow Monitoring Tool Published May 2015 We are the trusted national provider of high-quality information, data and IT systems for health and social care. www.hscic.gov.uk

COMMUNITY TECHNICAL SUPPORT Microsoft Excel Basics Introduction to Excel Click on the program icon in Launcher or the Microsoft Office Shortcut Bar. A worksheet is a grid, made up of columns, which are

OWrite is a crossplatform word-processing component for Mac OSX, Windows and Linux with more than just a basic set of features. You will find all the usual formatting options for formatting text, paragraphs

Software localization made easy Launching software products into multiple international markets simultaneously in local language can be a real challenge. Software localization is typically very intensive

InterCafe 2004 This manual and the appendant software belong to blue image GmbH Germany and are subject to the appendant license agreements and copyright regulations. 2004 blue image GmbH Manual Version

Punctuation in Academic Writing Academic punctuation presentation/ Defining your terms practice Choose one of the things below and work together to describe its form and uses in as much detail as possible,

Tips for optimizing your publications for commercial printing If you need to print a publication in higher quantities or with better quality than you can get on your desktop printer, you will want to take

Quick Start Guide Warehouse Pro Light Restaurant 2013 I. Application Installation During the initial installation of the product, the application gets to know the computer system, on which it shall work

Internationalizing the Domain Name System Šimon Hochla, Anisa Azis, Fara Nabilla Internationalize Internet Master in Innovation and Research in Informatics problematic of using non-ascii characters ease

Chapter 3: Data Mining Driven Learning Apprentice System for Medical Billing Compliance 3.1 Introduction This research has been conducted at back office of a medical billing company situated in a custom

KaleidaGraph Quick Start Guide This document is a hands-on guide that walks you through the use of KaleidaGraph. You will probably want to print this guide and then start your exploration of the product.

CHAPTER 1 Chapter Goals To learn about computers and programming To compile and run your first Java program To recognize compile-time and run-time errors To describe an algorithm with pseudocode In this

MODULE 2: SMARTLIST, REPORTS AND INQUIRIES Module Overview SmartLists are used to access accounting data. Information, such as customer and vendor records can be accessed from key tables. The SmartList

White Paper PDF Primer PDF What is PDF and what is it good for? How does PDF manage content? How is a PDF file structured? What are its capabilities? What are its limitations? Version: 1.0 Date: October