A Review of Chinese Word Lists Accessible on the Internet
Copyright © 1995-2006 Chih-Hao Tsai (Email: )
Related Technical Reports
Related Research Reports
Introduction
It is quite easy to find English word lists, or even English corpora, on the Internet. However, Chinese word lists are quite rare. In this article, I introduce a few word lists that I've found on the Internet. The comments associated with each word list were my subjective evaluation.
Chinese word lists are useful in various ways. I think they will be useful to people interested in natural language processing or text processing. People interested in computational linguistics will also find them useful.
Among the lists reviewed, two are in the form of Usenet news articles. They are the phonetic phrase database and ETen lc*.tab conversion program. I would suggest you download them and store them locally as soon as possible, since they are very likely to be removed in the future.
Review
libtabe lexicon
- Authors: Pai-Hsiang Hsiao, Chih-Hao Tsai, Tung-Han Hsieh, William Yeh, and Koan-Sin Tan
- Website: Libtabe Project
- Alternative Website: Libtabe @ XCIN
- Description: A list of Chinese words.
- Comments:
- Libtabe lexicon reviewing project: Human inspection and editing of all words. Participate in the project to make libtabe lexicon the best open source Chinese electronic lexicon. Please visit the above-mentioned website and the Usenet newsgroup tlug.xcin for details.
- tsi.src contains 138614 chinese words compiled from 3 sources. They are (1) Tsai's list of Chinese words, (2) IOME, and (3) Xcin.
- 138,614 entries.
- BIG5 code.
- Taiwanese/Mainland Mandarin vocabulary.
- Including zhuyin and word frequency information for most entries.
Tsai's list of Chinese words
- Author:Chih-Hao Tsai
- Download: tsaiword.zip (445KB)
- Description: A list of Chinese words.
- Comments:
- I spent a few days on integrating the following five lists (pd-phrases, phrases, phonetic phrase database, cihui dictionary, and ETen lcphrase.tab & lcword.tab) to make a unified, large Chinese word list. The file "tsaiword.txt" is a text file consists of a total of 137,450 unique entries. The entries are in BIG5 code and pre-sorted. Each entry ends with a CR-LF. For more information, see readme.txt.
- BIG5 code.
- Taiwanese/Mainland Mandarin vocabulary.
ezinput big lexicon
- Author: Qingsong Information
- Download: Qingsong Information "Qingsong Input Method Big Lexicon" Public License)
- Description: A list of Chinese words and their ezinput code.
- Comments:
- EZBIG.tit contains 120,000+ chinese words and their Qingsong (easy) input code.
- 120,000+ entries.
- BIG5 code.
- Taiwanese Mandarin vocabulary.
- Including ezpinput code for each entry.
pd-phrases
- Provider: Chinese Community Information Center
- Download: ftp://ftp.ccu.edu.tw/pub/chinese/data/pd-phrases.b5.gz
- Local copy: pd-phrases.zip (290KB)
- Description: A list of Chinese words.
- Comments:
- 89,068 entries.
- BIG5 code.
- Mainland Mandarin vocabulary.
phrases
- Provider: Chinese Community Information Center
- Download: ftp://ftp.ccu.edu.tw/pub/chinese/data/phrases.dat
- Local copy (GB): phrases.zip (148 KB)
- Local copy (BIG5): phrasesb5.zip (159 KB)
- Description: A list of Chinese words.
- Comments:
- 30,001 entries.
- GB code.
- Mainland Mandarin vocabulary.
- Including syntactic category and word frequency.
phonetic phrase database
- Provider: Kai-Hsu Tai
- Download: dphphdb.zip (612KB)
- Description: A list of Chinese words.
- Comments:
- 77,322 entries.
- BIG5 code.
- Taiwan Mandarin vocabulary.
- Phonetically (hanyu pinyin) ordered.
cihui dictionary
- Author: John Delacour
- Download: http://bd8.com/cihui/
- Description: A list of Chinese words.
- Comments:
- 80,193 entries.
- BIG5 code.
- (Mostly) Taiwanese Mandarin vocabulary.
- Phonetically (hanyu pinyin)ordered.
- Including a string of tonal pinyin for each word.
ETen lcphrase.tab & lcword.tab conversion program
- Author: Yi-Ru Li
- Download (Usenet article): rux.html (4KB)
- Description: A program converting lcword.tab and lcphrase.tab of ETen Chinese system to a list of words.
- Comments:
- About 26,991 entries will be converted by the program.
- C program.
- BIG5 code.
- Taiwanese Mandarin vocabulary.
duoyuanpinyin ciku for richwin
- Author: Jimmy
- URL: ftp://137.132.134.150/C/vv/richwin97-duoyuanpinyin-ci-ku-120300words/Wordlist.txt.zip
- Local copy: duoyuanpinyin.zip (417KB)
- Description: A list of Chinese words.
- Comments:
- 120,300 entries.
- GB code.
- Mainland Mandarin vocabulary.
- Phonetically (hanyu pinyin) ordered.