B5TOUNI - A BIG5 to Unicode SGML Character Reference Converter

Published: 2001-04-13

Document updated: 2001-12-10

Version: 0.10

License: GPL

Introduction

B5TOUNI is a small tool that converts big5 characters in a text document into their corresponding Unicode characters in the forms of SGML character references. This is a tool that I use when I need to include a few Chinese characters in a web page whose main language is English. Since you can only have one document character set in HTML, the English web page containing only a single big5 character should still be labeled as a big5 document, rather than a pure ASCII document. However, this kind of labeling is really not satisfying, and would confuse many web surfers.

Most of the time I simply use graphics for displaying Chinese characters, thus keeping the encoding of web pages pure ASCII. However, there are some situations where actual codes are preferred to graphics. My Zhuyin, Hanyu Pinyin, Tongyong Pinyin Cross-reference Table is such an example. The zhuyin syllables in the table are not just examples or demonstrations. They convey important information themselves. As a result, the best way is to use SGML character references to represent those zhuyin symbols. I first created a big5 version of that page, and then used B5TOUNI to convert it into a document in pure ASCII encoding. (Note. As of September 2005, the encoding of the pinyin pages have been converted to UTF-8.)

The " BIG5 to Unicode Mapping Table" from the Unicode Consortium is required for B5TOUNI to work.

Please note that characters not listed in the mapping table will not be converted by B5TOUNI and will be output in their raw code. As a result, if your documents contain those undefined characters, you may want to use HTML-Tidy from the World Wide Web Consortium (W3C) to clean up those non-ASCII characters by converting them into Latin-1 SGML character references.

Usage

If you are running B5TOUNI in the same directory as the mapping table big5.txt, simply type:

b5touni < input_file > output_file

However, if the mapping table is in a different directory, say, "c:\unicode", you need to type:

b5touni c:\unicode\big5.txt < input_file > output_file

Copyright Information

This program is free software; you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation; either version 2 of the License, or (at your option) any later version.

This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.

You should have received a copy of the GNU General Public License along with this program; if not, write to the Free Software Foundation, Inc., 59 Temple Place - Suite 330, Boston, MA 02111-1307, USA.

Download

b5touni.zip