Frequency and Stroke Counts of Chinese Characters
Copyright © 1996-2006 Chih-Hao Tsai (Email: )
Related Technical Reports
Related Research Reports
This is a list of frequency of usage and number of strokes of the 13,060 Chinese characters defined in the BIG5 coding scheme.
About the Corpus
The corpus consists of all the BIG-5 Chinese characters appeared on Usenet newsgroups during 1993-1994. It consists of 171,882,493 characters. This is perhaps by far the largest Chinese corpus in the world.
These frequency statistics were first reported separately for 1993 (80KB zip archive; 407KB HTML [UTF-8]) and 1994 (101KB zip archive; 538KB HTML [UTF-8]) (note) by Shih-Kun Huang at the Department of Computer Science and Information Engineering, National Chiao-Tung University, Taiwan. (He is now at the Institute of Information Science, Academia Sinica, Taiwan.)
Number of Strokes
The number of strokes was extracted from the the Chinese Character Database (CCDB) developed by the Chinese Character Analysis Group for Information Application, the Council for Cultural Planning and Development, Executive Yuan, Taiwan. (Available ftp://nctuccca.edu.tw/Chinese/CCDB.) This part of work was done by me.
Frequency-weighted average number of strokes:
- For the most frequently used 2,965 characters: 9.10;
- For the most frequently used 1,253 characters: 8.91;
- For the most frequently used 733 characters: 8.65.
These lists consist of pooled (1993+1994) data calculated by me.