A List of Chinese Names

Published: 2000-03-21

Updated: 2012-08-06


Chinese names are very unique in a number of aspects.

As a result, the identification of Chinese personal names (and other proper names as well) becomes a difficult task in tokenizing Chinese character strings. Many successful word identification systems rely on specific algorithms or name databases to handle the name identification problem. Even with the algorithmic approach, a good database is still needed in discovering potentially unique characteristics that may be used in identifying names.

It is not easy to construct a good database of personal names. The government will always have some huge lists of names. However, for privacy reasons, ordinary people are not likely to have access to them. Research institutes studying natural language processing may have their own, but their databases may not open to the general public.

Luckily enough, I was still able to find some very good lists of Chinese names.

In Taiwan, until 2002, high school graduates must take the Joint College Entrance Examination (JCEE) if they want to go to college. Prior to 2002, admissions to institutions of higher education is solely based on the results of JCEE. Recently, under a newly implemented system, students have more alternative routes into higher education, but the JCEE is still the dominant route.

Each year, more than 100,000 high school graduates take the exam. In the early 1990's, about 60% of the examinees could passed the exam. In recent years, nearly 90% of the examinees could pass the exam.

Each year, the JCEE committee not only notifies the individuals who passed the exam, but also publishes a list of the names of all passed examinees along with the institutes they are admitted into. Since 1994, electronic versions of the lists have also been made available. I was able to find the lists for 1994-2004 from http://www.csie.nctu.edu.tw/service/jcee/. Besides, I was also able to manually compile the 2005-2012 lists from the department-based lists found on http://www.uac.edu.tw/.

Since the lists were in their raw form, pre-processing was needed to remove unnecessary information, such as the names of colleges and departments as well as the examinees' id numbers. The 16 tidied up lists were then merged into a single list, which contains 1,418,338 name tokens. Duplicates were then deleted to form another list, which contains 726,544 unique names.

Finally, please note that the character encoding of the list is BIG5.


unique_names_2012.zip (1.56 MB)