A string that includes every Pinyin printable character or punctuation
mark (including whitespace). This can be used as a whitelist for Pinyin text.
Constant format: characters listed individually

zhon.pinyin.syllable or zhon.pinyin.syl

A regular expression pattern that matches a valid Pinyin syllable (accented or
numbered). Use with the re.I flag if you want to match uppercase
letters as well.
Constant format: regular expression pattern

A regular expression pattern that matches a valid Pinyin sentence (accented or
numbered). Use with the re.I flag if you want to match uppercase
letters as well.
Constant format: regular expression pattern

A string containing characters considered by CC-CEDICT to be Traditional
Chinese characters. Some of these characters are also present in
zhon.cedict.simplified because many characters were left untouched by
the simplification process.
Constant format: characters listed individually

zhon.cedict.simplified

A string containing characters considered by CC-CEDICT to be Simplified
Chinese characters. Some of these characters are also present in
zhon.cedict.traditional because many characters were left untouched by
the simplification process.
Constant format: characters listed individually

Using Zhon’s Constants

Using the constants listed above is simple. For constants that list the
characters individually, you can perform membership tests or use them in
regular expressions:

>>>'车'inzhon.cedict.traditionalFalse>>># This regular expression finds all characters that aren't considered...# traditional according to CC-CEDICT...re.findall('[^%s]'%zhon.cedict.traditional,'我买了一辆车')['买','辆','车']

For constants that contain character code ranges, you’ll want to build a
regular expression:

>>>re.findall('[%s]'%zhon.hanzi.punctuation,'我买了一辆车。')['。']

For constants that are regular expression patterns, you can use them directly
with the regular expression library, without formatting them:

Identifying Text as Chinese

Identifying a character, word, or sentence as Chinese is not a simple
undertaking. Zhon’s module hanzi includes Han ideographs, which are not the
same thing as Chinese characters. Chapter 12 of The Unicode Standard has some
useful information about this:

There is some concern that unifying the Han characters may lead to confusion because they are sometimes used differently by the various East Asian languages. Computationally, Han character unification presents no more difficulty than employing a single Latin character set that is used to write languages as different as English and French. Programmers do not expect the characters “c”, “h”, “a”, and “t” alone to tell us whether chat is a French word for cat or an English word meaning “informal talk.” Likewise, we depend on context to identify the American hood (of a car) with the British bonnet. Few computer users are confused by the fact that ASCII can also be used to represent such words as the Welsh word ynghyd, which are strange looking to English eyes. Although it would be convenient to identify words by language for programs such as spell-checkers, it is neither practical nor productive to encode a separate Latin character set for every language that uses it.

In other words, don’t expect Zhon constants to identify a string as Chinese as
opposed to Japanese or Korean. Zhon’s hanzi.characters constant represents all
Han characters, not Chinese characters.

Name

Zhon is short for ZHongwen cONstants. It is pronounced like the name ‘John’.