uni2ascii

  5,011 downloads
4.18 GPL v3    
3.4/5 24
uni2ascii and ascii2uni convert between UTF-8 Unicode and any of a variety of 7-bit ASCII.

description

download

specifications

changelog

uni2ascii and ascii2uni convert between UTF-8 Unicode and any of a variety of 7-bit ASCII equivalents including: hexadecimal and decimal HTML numeric character references, u-escapes, standard hexadecimal, and raw hexadecimal.

Such ASCII equivalents are useful when including Unicode text in program source, when entering text into Web programs that can handle the Unicode character set but are not 8-bit safe, and when debugging.

The Unicode escapes available are:

- HTML hexadecimal numeric character references (e.g. �)
- HTML decimal numeric character references (e.g. ȳ)
- u-escapes, as used in Python (e.g. u00E9)
- u-escapes within the BMP and U-escapes beyond the BMP, e.g. u00E9 but U00010024.
- U -escapes (e.g. U 00E9)
- U-escapes (e.g. U00E9)
- u-escapes (e.g. u00E9)
- U-escapes within angle brackets (e.g. )
- x-escapes (e.g. x00E9)
- x-escapes with braces (e.g. x{00E9})
- Standard hexadecimal (e.g. 0x00E9)
- Raw hexadecimal (e.g. 00E9)

uni2ascii accepts a command line flag determining whether to generate upper-case A-F or lower-case a-f as hexadecimal digits since some some programs accept only one or the other. ascii2uni accepts either.

In the case of uni2ascii by default, only characters outside the ASCII range are converted. Even if ASCII characters are also converted, newlines are preserved unless their conversion is explicitly requested. Space characters are also preserved unless conversion is explicitly requested. In the case of the three non-ASCII space characters (Ethiopic word space, Ogham space, and ideographic space), if space characters are not converted, these are replaced with ASCII space (0x20) so as to keep the output within the 7-bit ASCII range.

This package contains four programs. The main program is uni2ascii. It is written in C and must be compiled. uni2html.py is the predecessor to uni2ascii. As it is written in Python, it does not need to be compiled and should run on just about any current computer. uni2ascii is otherwise superior in that:

- It generates a wider range of output formats.
- It is approximately 20 times faster.
- It handles input in the full 32 bit Unicode range. In contrast, uni2html handles only the

Basic Multilingual Plane (Plane 0) because at present Python represents Unicode encoded text internally using 16-bit integers. If you've got text in, say, Linear B or Ugaritic, you need uni2ascii.

It does a better job of reporting errors. If it encounters an error in its input, such as mal-formed UTF-8, it reports the location of the error both in terms of the character count from the beginning of the file (starting at 0) and in terms of the byte count from the beginning of the file (also starting at 0). (Character counts and byte counts are generally not the same since a UTF-8 encoded character occupies from one to four bytes.) The Python version reports only the character count. uni2ascii also provides information about the nature of the error.

The third program, ascii2uni, is the inverse of uni2ascii. It accepts text containing a variety of ASCII representations of Unicode characters and generates UTF-8 Unicode.

The fourth program, ascii2uni.py, reads 7-bit ASCII containing u-escaped Unicode, as used in Python and Tcl, and converts it to UTF-8 Unicode. It is the original program of which ascii2uni is a generalization.
read more   
Last updated on May 16th, 2011

0 User reviews so far.

SUBMIT