CWFC - A Chinese Word Frequency Counter
Copyright © 2001-2006 Chih-Hao Tsai (Email: )
Related Technical Reports
Related Research Reports
The Chinese Word Frequency Counter (CWFC) is a very small C program (around 150 lines of code) that counts how many times each word appears in any given Chinese text (big5 character set). It is presented not as a full-featured application, but as a technical demonstration of the usability of my Chinese lexical scanner-CScanner. Cscanner is used to identify words from strings of Chinese text where word boundaries are usually absent. To use CWFC, you will also supply it a Chinese lexicon to be used by Cscanner. Visit CScanner website and find out where you can download a Chinese word list.
cwfc < input_file > outout_file
I tested CWFC with large files which resulted in a word list well over 100,000 unique words. Cscanner appeared to be capable of handling it safely and speedy. So I think CWFC can be easily augmented to build a word frequency list with a realistic size (i.e., more than 100,000 words).
This program is free software; you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation; either version 2 of the License, or (at your option) any later version.
This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.
You should have received a copy of the GNU General Public License along with this program; if not, write to the Free Software Foundation, Inc., 59 Temple Place - Suite 330, Boston, MA 02111-1307, USA.