CWFC - A Chinese Word Frequency Counter

Published: 2001-04-07

Version: 0.11

License: GPL

Introduction

The Chinese Word Frequency Counter (CWFC) is a very small C program (around 150 lines of code) that counts how many times each word appears in any given Chinese text (big5 character set). It is presented not as a full-featured application, but as a technical demonstration of the usability of my Chinese lexical scanner-CScanner. Cscanner is used to identify words from strings of Chinese text where word boundaries are usually absent. To use CWFC, you will also supply it a Chinese lexicon to be used by Cscanner. Visit CScanner website and find out where you can download a Chinese word list.

Usage

cwfc < input_file > outout_file

Sample Input

sample input

Sample Output

sample output

Scalability

I tested CWFC with large files which resulted in a word list well over 100,000 unique words. Cscanner appeared to be capable of handling it safely and speedy. So I think CWFC can be easily augmented to build a word frequency list with a realistic size (i.e., more than 100,000 words).

Copyright Information

This program is free software; you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation; either version 2 of the License, or (at your option) any later version.

This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.

You should have received a copy of the GNU General Public License along with this program; if not, write to the Free Software Foundation, Inc., 59 Temple Place - Suite 330, Boston, MA 02111-1307, USA.

Download

cwfc011.zip