UnicodeConverter is a Java program that converts text and HTML files in ISC, TCVN3 (ABC), VISCII, VNI, and VPS format to Unicode UTF-8. Conversion support for Unicode Composite, Numeric Character References (NCR), and VIQR (Vietnet) is also included. In all cases, the output will be in Unicode Normalization Form C, or better known as Unicode Precomposed format.
UnicodeConverter, executable in both graphic user interface (GUI) and command-line modes, is capable of converting multiple files in a directory, or an entire directory, including its subdirectories. In effect, this enhanced capability enables conversion of an entire website to Unicode UTF-8 format with one single command or a few mouse clicks. Drag-and-Drop support is also included.
Support for conversion of Word documents and Excel workbooks on the Windows platform is included. This feature is implemented using JACOB, a Java-COM Bridge that allows clients to call COM Automation components from Java. JACOB uses Java Native Interface (JNI) to make native calls into the COM and Win32 libraries; consequently, the added functionality is not portable nor available to other platforms. Conversion support for Rich Text Format files is also provided.
UnicodeConverter is released and distributed under the GNU General Public License. Its homepage is at http://unicodeconvert.sourceforge.net.
You will need to have the Java 2 Runtime Environment, Standard Edition (JRE) 1.4 or later installed on your machine to execute UnicodeConverter. J2RE can be downloaded free from http://java.sun.com/j2se/. The Java 2 Runtime Environment, Standard Edition (JRE) consists of the Java virtual machine, the Java platform core classes, and supporting files to allow you to run applications written in the Java programming language.
On Mac OS X Tiger or Panther, UnicodeConverter runs without additional requirements. For Jaguar 10.2.6 or later, Java 1.4.1 Update 1 can be installed.
To be able to convert Word or Excel documents, you'll need to be on a Windows system with Microsoft Word or Excel installed. Put the file jacob.dll in your path, for example, into the system32 or jre/bin folder.
HOW TO RUN UnicodeConverter
UnicodeConverter is written in Java language and packaged as executable Java-Archive. Download and unzip UnicodeConverter-1.3.zip. UnicodeConverter.jar is the Java-Archive executable program to be run. You can run it either by double-clicking the UnicodeConverter.jar file or by executing the command uni at the command line to launch the program in GUI mode. Alternatively, the longer commands
java -jar UnicodeConverter.jar
or (on Windows)
javaw -jar UnicodeConverter.jar
will work, too. The filename is case-sensitive on some operating systems. Be sure the directory that contains the UnicodeConverter.jar file is the current directory.
Note: It is recommended that Microsoft Word/Excel not open any file when you convert Word/Excel documents. It may cause errors or slow down the conversion process.
Tip: Minimize the number of text boxes within Word documents to a few; having too many will slow down conversion significantly.
You can select single or multiple files, or a directory d for conversion. The resulting Unicode output files will be placed in a d_Unicode directory located at the same tree level as the source directory that contains the original files, which remain unchanged. You also can drag files or directory from native file manager and drop onto the application window to initiate conversion operation.
The program can also function as a command-line program, which is frequently used in batch file processing:
java -jar UnicodeConverter.jar < SourceEncoding > < SourceFile/Dir > < TargetFile/Dir >
where possible options for source encoding are VNI, VISCII, VPS, VIQR, TCVN3, and UNI-COMP. This functionality works for text-based files only, not Word/Excel documents.
Unicode composite (UNI-COMP) source text files should be saved in UTF-8 format for correct conversion to Unicode precomposed.
The default fonts for the output UTF-8 HTML files are Times New Roman, and Arial. Users can change to other Unicode-compliant fonts, using Unicode-compatible HTML editors such as FrontPage or Composer. Do not use Unicode-incompatible editors (such as Notepad of Win9x/Me) to edit UTF-8 files. Doing so would corrupt the UTF-8 byte sequence, rendering the characters or the file unreadable.
Use Firefox, Netscape, Internet Explorer (Windows), Opera, Mozilla, Safari, OmniWeb, or Chimera web browsers to view UTF-8 HTML files. You will not need to change their default settings; the tag tells the browsers to use Unicode UTF-8 character encoding in displaying the page.
FILE PREPARATIONS FOR CONVERSION
To ensure successful conversion of HTML files in legacy formats and to minimize post-conversion editing, some pre-conversion conditioning may need to be performed on the source files. Changing the original document fonts to the more common ones with respect to its original encoding may be needed (see table below). Removing obsolete dynamic font links (.pfr or .eot) and associated ActiveX control scripts (e.g., tdserver.js) is also recommended, for leaving them in will needlessly slow down page download.
These basic editing tasks should be done prior to the actual conversion process and can be expeditiously performed by using MDI (multiple document interface) text editors which allow opening multiple files and performing global find/replace actions on all open files at once. CuteHTML, TextPad, UltraEdit, EditPlus, and EditPad are some text editors that sport such useful features. They can be searched and downloaded from http://www.download.com.
Source Encoding Fonts for original HTML documents
VNI VNI-Times, VNI Times, VNI-Aptima, VNI Aptima, VNI-Helve, VNI Helve
VPS VPS Times, VPS Helv
VISCII VI Times, VI Arial, HoangYen, MinhQu, PhuongThao, ThaHuong, UHo
TCVN3 .VnTime, .VnTimeH, .VnArial, .VnArialH
VIQR No font formatting
Note: Due to the nature of TCVN3 encoding, conversion of some Vietnamese capital vowels will result in incorrect, lower case. Some post-conversion editing may be necessary.
Unicode has only limited support in Windows 95/98/Me, but they are still capable of displaying all Vietnamese characters using appropriate Unicode fonts. Full Unicode support is built into Windows NT/2000/XP. Linux and Mac OS 8.5 or greater have begun to provide support Unicode. Mac OS X and Palm OS provide full Unicode support.
The following TrueType fonts, which come supplied with Windows 98SE/Me/2000/XP, contain many Unicode characters, including Vietnamese:
Times New Roman, Courier New, Arial, Tahoma, Verdana, Palatino Linotype
This list of Unicode fonts is by no means comprehensive, as there are more and more fonts are being commercially developed or expanded to include Unicode characters.
· Java 1.4.2 or later
What's New in This Release:
· Refactored using Design Patterns to improve code reusability, program extensibility and maintainability
· Updated JACOB library to version 1.9.1