CScanner - A Chinese Lexical Scanner

Published: 2000-03-10

Updated: 2001-04-07

Version: 0.95

License: GPL

Introduction

Lexical scanner is a function that finds the next token in an input stream and identifies its type. For alphabetical programming/natural languages, the token boundaries are usually present in the text as spaces. Therefore, identifying tokens is not difficult. Once a token is identified or in the process of identification, a small set of rules is usually sufficient in unambiguously identifying its type. For the Chinese language, however, the scenario becomes quite different. In Chinese text, all characters are spaced evenly. Although the majority of Chinese words are multi-character words, word boundaries are nevertheless not marked by extra spaces. Consequently, finding essential linguistic tokens--the words--becomes a non-trivial task.

CScanner is a Chinese (traditional Chinese; BIG5) lexical scanner based on two variants of the maximum matching heuristic of word identification. The first one is the forward maximum matching algorithm, which is among the easiest word identification algorithms to implement, yet yields considerably high accuracy in identification. The second one is a set of four algorithms (three of which were originally published by Chen and Liu, 1992) which uses a larger context of three words (instead of the no context situation of forward maximum matching) in evaluating the most likely first word. If you need more information on word identification issues in Chinese or want to understand more about the algorithm used in CScanner, you may want to visit my MMSEG page.

CScanner is written in ANSI C.

Version 0.26 is still more like a prototype. It is probably the first of its kind that a lot of functions and features need to be experimenting with. At the initial stage, the focus is more on implementing the basic interface and functions of the scanner than on accuracy of identification, specific features, or efficiency of time and space.

Types of Tokens

/* Token types */
typedef enum
  {
    UNRECOGNIZED, ASCII_WORD, ASCII_SYMBOL, ASCII_8BIT, 
      BIG5_WORD, BIG5_SYMBOL, BIG5_UNDEFINED
  }
Tokentype;

ASCII_WORD (1): Any token consisting of only letters and digits. Examples: hello, world, 749, y2k, win98 ....

ASCII_SYMBOL (2): Any 7-bit ASCII printable character that is neither a letter nor digit.

ASCII_8BIT (3): Any 8-bit ASCII character that is not identified as part of a BIG5 character.

BIG5_WORD (4): Any Chinese word identified by the word identification algorithm. If a Chinese character is not listed in the lexicon, the character itself is treated as a word.

BIG5_SYMBOL (5): Any BIG5 code that represents digits, punctuation marks, or other symbols but does not represent a real Chinese character.

BIG5_UNDEFINED (6): Any legal BIG5 code with no corresponding defined character in the standard implementation of the BIG5 coding scheme.

The Lexicon Object

/* Definition of a lexicon */
typedef struct lexicon Lexicon;
struct lexicon
  {
    long int size;      /* total number of words */
    Skiplists Word;     /*  an ordered list of words (skip lists) */
    unsigned char *charfreq;    /* an array of character frequencies */
  };

Lexicon Public Functions

/* Lexicon functions */
Lexicon *c_scanner_load_lexicon (const char *filename);
char *c_scanner_search_lexicon (const char *key, const Lexicon *L);
void c_scanner_free_lexicon (Lexicon *L);

c_scanner_load_lexicon() loads the lexicon and sort it if necessary, returns a pointer to the lexicon.

c_scanner_search_lexicon() searches a word in the lexicon, returns NULL if not found, else returns a pointer to the lexical entry found.

c_scanner_free_lexicon() destroys the lexicon object.

The CScanner Object

/* Definition of a CScanner object */
typedef struct cscanner CScanner;
struct cscanner
  {
    const char *text;   /* text to be scanned */
    unsigned int length;    /* how far the scanner will go */
    int token_length_max;   /* Maximum length of a token */
    int chinese_word_length_max;    /* Maximum length of a Chinese word */
    const Lexicon *lexicon; /* pointer to the lexicon */
    char token[TOKEN_LENGTH_MAX + sizeof (unsigned char)]; /* current token */
    Tokentype type;     /* current token type */
    unsigned int token_position;    /* position of current token */
    unsigned int position;  /* starting position of next scan */
  };

CScanner Public Functions

/* Main CScanner functions */
CScanner *c_scanner_new (Lexicon *L);
int c_scanner_input_text (CScanner *scanner, const char *text, long int length);
unsigned char *c_scanner_get_token (CScanner *scanner, int complexity);
void c_scanner_free (CScanner *scanner);

c_scanner_new() creates a CScanner object, associates an existing lexicon created by c_scanner_load_lexicon() with the CScanner object, and returns a pointer to the CScanner object.

c_scanner_input_text() associates the input text with the existing CScanner object, and specifies the range of the text to be scanned.

c_scanner_get_token() scans the scanner object for the first token available. If found, returns a pointer to the token, else return NULL.

c_scanner_free() destroys an existing CScanner object created by c_scanner_new().

BIG5 Handling Helpers

/* BIG5 handling functions */
int is_big5_leading_char (unsigned char);
int is_big5_trailing_char (unsigned char);
int is_big5_chinese_char (unsigned char, unsigned char);
int is_big5_symbol (unsigned char, unsigned char);
int is_big5_undefined_char (unsigned char, unsigned char);

Tips

With multiple CScanner objects, you can scan multiple inputs concurrently.

With multiple Lexicon objects, you can use multiple lexicons concurrently.

A CScanner object can be reused to scan different inputs with the same lexicon.

The Lexicon

The distribution of CScanner 0.26 does not include a lexicon. You can use whatever lexicon or word list you have as long as its format meets the requirement described in Section 4.1. If you do not have one, here are my suggestions.

Go to A Review of Chinese Word Lists Accessible On the Internet website and download Tsai's List of Chinese Words (tsaiword.zip) or other BIG5 encoded lists.

Go to Libtabe's website and download the latest release of libtabe. When decompressed, go to "libtabe/src/tsi-src" directory. You should see a file named "tsi.src", and that is the list you want. Be sure to read "Readme" and "Copying" in the same directory for copyright information regarding using this lexicon.

An Example

#include <stdio.h>
#include <stdlib.h>
#include "cscanner.h"


int
main (void)
{
  Lexicon *L;
  CScanner *scanner;
  char *token;
  char text[4096];

  fprintf (stderr, "Please wait while CScanner loading the lexicon.\n");

  L = c_scanner_load_lexicon ("../../libtabe/src/tsi-src/tsi.src");

  if (L)
    fprintf (stderr, "%d entries loaded.\n\n", L->size);
  else
    {
      fprintf (stderr, "Failed to load lexicon.\n");
      exit (1);
    }

  scanner = c_scanner_new (L);

  while (fgets (text, 4096, stdin))
    {
      c_scanner_input_text (scanner, text, 0);
      printf ("\nScanning the following string:\n\n%s\n", text);
      while ((token = c_scanner_get_token (scanner, 1)))
        printf ("\tposition = %u\ttype = %d\ttoken = %s\n", 
          scanner->token_position, scanner->type, token);
      printf ("\n\n");
    }

  c_scanner_free (scanner);
  c_scanner_free_lexicon (L);

  exit (0);
}

Future Development

The capability of handling ASCII tokens needs to be enhanced.

References

Chen, K. J., & Liu, S. H. (1992). Word identification for Mandarin Chinese sentences. Proceedings of the Fifteenth International Conference on Computational Linguistics, Nantes: COLING-92.

Pugh, W. (1990). Skip lists: A probabilistic alternative to balanced trees. Communications of the ACM, 33(6), 668-676.

Tsai, C. H. (1996). MMSEG: A word identification system for Mandarin Chinese text based on two variants of the maximum matching algorithm [On-line]. Available: http://technology.chtsai.org/mmseg/.

Copyright Information

This program is free software; you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation; either version 2 of the License, or (at your option) any later version.

This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.

You should have received a copy of the GNU General Public License along with this program; if not, write to the Free Software Foundation, Inc., 59 Temple Place - Suite 330, Boston, MA 02111-1307, USA.

Download

cscanner026.zip

Previous versions: cscanner025.zip

Change Log

2001-04-07: Version 0.26.

  1. The data structure for the Lexicon object's word list has been changed from a simple array to skip lists. Skip lists are a relatively new data structure introduced in 1990 by Bill Pugh in Communications of the ACM. It is a probabilistic alternative to the widely used balanced binary trees.

    Pugh, W. (1990). Skip lists: A probabilistic alternative to balanced trees. Communications of the ACM, 33(6), 668-676.

    A PDF version of Pugh (1990) is available at:
    ftp://ftp.cs.umd.edu/pub/skipLists/skiplists.pdf

    Sample code is available at:
    ftp://ftp.cs.umd.edu/pub/skipLists/

    My implementation is basically based on the C sample code skipLists.c provided by Professor Pugh at the above-mentioned site. What I have done was mainly to augment the original code to make it capable of handling arbitrary data types of key and value. In addition, I have also eliminated global variables and made quite a few stylish changes to improve the usability and readability.

2000-03-14: Version 0.25.

  1. Incorporated changed made by William Yeh (pjyeh@cis.nctu.edu.tw).
  2. Minor stylist changes to clean up the code a little bit.

2000-03-14: William Yeh (pjyeh@cis.nctu.edu.tw)

  1. Added const qualifiers to the variables in the definition of the CScanner object and to the variables in the declarations and definitions of c_scanner_new() and c_scanner_input_text(). Stylist changes to make the source confirm more strictly to the ANSI C standard.
  2. Modified the definition of the CScanner object and the c_scanner_new() code to remove redundant code. Stylist changes to make the code more readable.
  3. Added include guard in the header file to avoid the nested include "cscanner.h" problem.
  4. Added "extern C" device into cscanner.h to make C and C++ share the same object code.
  5. Moved the four "#include" lines from cscanner.h to cscanner.c to reduce the external dependency of cscanner.h and thus make the interface cleaner.

2000-03-14: Version 0.24.

  1. Incorporated changes made by William Yeh (pjyeh@cis.nctu.edu.tw).
  2. Replaced ASCII_ALPHANUM with ASCII_WORD and BIG5_CHINESE with BIG5_WORD. Stylist changes to make token types for the two character sets more symmetrical.

2000-03-13: William Yeh (pjyeh@cis.nctu.edu.tw)

  1. Added const qualifiers to the variables in the declarations and definitions of c_scanner_load_lexicon() and c_scanner_search_lexicon(). Stylist change to make the source confirm more strictly to the ANSI C standard.
  2. Adjusted the orders of if-then judgments in c_scanner_load_lexicon(). Stylist change to make the code more readable.
  3. Replaced the CScanner built-in heapsort and binary search functions with qsort() and bsearch() functions in the ANSI C standard library. A major change to improve the sorting speed and a stylist change to make the source more readable.
  4. Modified c_scanner_free_lexicon() to make the memory allocated for the lexicon correctly freed. A major change fixed a serious bug in the old code.

2000-03-12: Version 0.22.

  1. Used recursion, instead of three loops in version 0.20, in implementing the set of complex word identification algorithms.
  2. The format of the character frequency table charfreq.c changed from 13,060 comma-delimited numbers to a single string of 13,060 characters long.

2000-03-12: Version 0.20.

  1. Added the set of four "complex" algorithms used in MMSEG.
  2. Since this set of algorithms require character frequency information, a character frequency table is now supplied with CScanner.
  3. Although this is a significant change in the program, the interface has not changed too much. The only change is the c_scanner_get_token() function; a second argument was added for specifying the algorithm to be used.

2000-03-10: Version 0.10. First public release of CScanner. The forward maximum matching algorithm was implemented.