On Unicode and the Chinese Writing System
Copyright © 1995-2006 Chih-Hao Tsai (Email: )
Related Technical Reports
Related Research Reports
The Evolution of Character Coding System
The 7 bit coding system ASCII can code 128 characters, which was the first standard character coding system. It handles English alphabet quite well, but cannot code all the alphabets of European languages. As a result, a 8 bit coding system Latin-1 (ISO 646), which can code 256 characters, was developed in 1977 to include most European alphabets. However, 8 bits are certainly not enough to code Chinese characters and symbols in other logographic or morphosyllabic writing systems. Therefore, in 1983, ISO 10646 which used 32 bits to code over 4 billion characters evolved. One of the problems of ISO 10646 is that its 32 bit code is too big, which severely influences the data transmission rate. In 1987, the 16 bit Unicode which can code 65,536 characters, was developed. Since then, the Unicode Standard has undergone several changes and has attracted much attention.
Is Unicode Really Universal?
In Unicode, the Chinese characters used to write Chinese, Japanese, and Korean are consolidated into a single common repertoire. This is achieved by assigning only one code point to each Chinese character. Here comes a problem: How to determine the order of characters? In GB and JIS coding scheme, the characters are ordered by their pronunciations. In BIG5 coding scheme, the characters are ordered by frequency first: 7,659 rarely used characters followed by 5,401 commonly used characters. Within each block, the characters are ordered by number of strokes. Although the ordering of the Chinese characters is claimed to be culturally neutral, the order is nevertheless different from that of GB, BIG5, or JIS. To map the Unicode Chinese characters to different major encoding standards, different mapping tables are provided. However, those mapping tables and additional processes make Unicode no longer simple and universal: one does not need a mapping table when processing English alphabet, but need them when processing Chinese characters or other non-English alphabets. Many people think that the proposal of "elimination of duplicate Chinese characters" reflects the ego-centric thought of speakers of the English language.
The Impact of Unicode On the Chinese Writing System
At present, Unicode supports over 70,000 CJK characters. This may seem a lot, but the number of new characters is expected to continuously increase, because Chinese people are inventing new characters everyday. They do not invent new characters to express new concepts. Rather, new characters are used in personal names. The number of different characters used in personal names is far greater than the number of characters that are actually used in recording spoken Chinese. If people continue to invent new characters, the code space provided by Unicode will be used up very soon. In fact, it is already used up; 27,487 (20,902+6,585) is perhaps the maximum.
If all other languages, except Chinese, are satisfied with Unicode, then Chinese characters will be in trouble. If Chinese people want to conform to the Unicode Standard, they must prevent new characters from being invented as well as use the Unicode Standard to regulate character use. This will be a kind of writing reform. It may not be as radical as character simplification or romanization, but it may well encounter certain amount of resistance.
More Reflections
This little article was language oriented rather than technique oriented. I concerned more about language and writing system than a universal coding system. Not too long ago, many Chinese people thought that their writing system was inferior and was not 'scientific' enough. Being shocked by the technical advance of European and American countries, they disliked their own writing system and thought that the Chinese writing system need to be reformed. People first proposed to abolish Chinese characters completely. People also proposed to simplify characters. Their reasons were: (1) The literacy was low, because the characters were too hard to learn; and (2) the characters could not be computerized.
However, the literacy in Taiwan (where people use traditional Chinese writing system) is just as high as that of Japan or the United States. This demonstrates that the characters have not presented people who use them to reach high literacy. Economy, education quality, and other non-linguistic factors are more responsible. Also, with the advance of technique, the Chinese writing system now can be computerized quite well.
I viewed Unicode from the same perspective. If Unicode distorts some aspects of the Chinese writing system, then we should improve it. Writing system should not be changed solely for technical reasons.
I also want to stress that the Chinese writing system itself needs to be reformed, too. Eliminating unnecessary characters is good for the standardization of the Chinese writing system. The limited coding space of Unicode sheds light on the problem of dead characters. Personally, I am on the Unicode side.
References
Sheldon, K. M. (1991, July). ASCII goes global. Byte, 108-116.
Unicode Inc. (1993). Unicode glossary [On-Line]. Available http://www.unicode.org/glossary/
Unicode Inc. (1994). Han unification [On-Line]. Available http://www.unicode.org/book/appA.pdf