Frequency and Stroke Counts of Chinese Characters
Copyright © 1996-2006 Chih-Hao Tsai (Email: )
Related Technical Reports
Related Research Reports
Introduction
This is a list of frequency of usage and number of strokes of the 13,060 Chinese characters defined in the BIG5 coding scheme.
About the Corpus
The corpus consists of all the BIG-5 Chinese characters appeared on Usenet newsgroups during 1993-1994. It consists of 171,882,493 characters. This is perhaps by far the largest Chinese corpus in the world.
Character Frequency
These frequency statistics were first reported separately for 1993 (80KB zip archive; 407KB HTML [UTF-8]) and 1994 (101KB zip archive; 538KB HTML [UTF-8]) (note) by Shih-Kun Huang at the Department of Computer Science and Information Engineering, National Chiao-Tung University, Taiwan. (He is now at the Institute of Information Science, Academia Sinica, Taiwan.)
For detailed information about the nature of the corpus and the programs used, you should contact Shih-Kun Huang (Email: skhuang@iis.sinica.edu.tw).
Number of Strokes
The number of strokes was extracted from the the Chinese Character Database (CCDB) developed by the Chinese Character Analysis Group for Information Application, the Council for Cultural Planning and Development, Executive Yuan, Taiwan. (Available ftp://nctuccca.edu.tw/Chinese/CCDB.) This part of work was done by me.
Some Statistics
Frequency-weighted average number of strokes:
- For the most frequently used 2,965 characters: 9.10;
- For the most frequently used 1,253 characters: 8.91;
- For the most frequently used 733 characters: 8.65.
View
- Sorted by character (269KB HTML [UTF-8])
- Sorted by frequency (269KB HTML [UTF-8])
Download
These lists consist of pooled (1993+1994) data calculated by me.
- characters.zip (58KB; sorted by characters).
- sorted.zip (53KB; sorted by frequency).