Frequency and Stroke Counts of Chinese Characters

Published: 1996-01-01

Updated: 2005-08-03

Introduction

This is a list of frequency of usage and number of strokes of the 13,060 Chinese characters defined in the BIG5 coding scheme.

About the Corpus

The corpus consists of all the BIG-5 Chinese characters appeared on Usenet newsgroups during 1993-1994. It consists of 171,882,493 characters. This is perhaps by far the largest Chinese corpus in the world.

Character Frequency

These frequency statistics were first reported separately for 1993 (80KB zip archive; 407KB HTML [UTF-8]) and 1994 (101KB zip archive; 538KB HTML [UTF-8]) (note) by Shih-Kun Huang at the Department of Computer Science and Information Engineering, National Chiao-Tung University, Taiwan. (He is now at the Institute of Information Science, Academia Sinica, Taiwan.)

For detailed information about the nature of the corpus and the programs used, you should contact Shih-Kun Huang (Email: skhuang@iis.sinica.edu.tw).

Number of Strokes

The number of strokes was extracted from the the Chinese Character Database (CCDB) developed by the Chinese Character Analysis Group for Information Application, the Council for Cultural Planning and Development, Executive Yuan, Taiwan. (Available ftp://nctuccca.edu.tw/Chinese/CCDB.) This part of work was done by me.

Some Statistics

Frequency-weighted average number of strokes:

View

Download

These lists consist of pooled (1993+1994) data calculated by me.