Unicode


Unicode is an extension of ASCII to a much larger character set. ASCII is a seven-bit encoding and hence has 128 character codes, not all of which correspond to printable graphics. Unicode exists in two versions, UCS-2 and UCS-4. UCS-2 is a 16-bit encoding and hence has 64 thousand available character codes. UCS-4 is a 32-bit encoding and hence has four billion available character codes. The assignment of codes to characters is an ongoing standards activity involving all the countries of the world.

Unicode has character codes not only to represent graphics, but also for other uses, such as reversing presentation direction for languages from the mideast. The concept of identifier is broadened to include a rich set of printable characters plus a set of ignorable characters.

The most current Unicode tables are available (at a price) from the standards organization. Information is available at the URL:

http://www.cm.spyglass.com/unicode/uni2book/u2.html

UTF-8 is yet another Unicode standard. It can represent any UCS-2 character, encoding 16 bits as one, two, or three bytes. A given value must be encoded with the shortest possible UTF-8 sequence, leaving unused some two- and three-byte sequences. The one-byte codes are distinguished by a zero in the leftmost bit, leaving only seven bits for information. The first byte of a two-byte sequence is distinguished by the leading bit pattern 110. The second byte starts with the leading bit pattern 10. Adding the five remaining significant bits from the first byte and six significant bits of from the second byte gives a total of eleven bits of information for a two-byte UTF-8 sequence. The first byte of a three-byte encoding is distinguished by the leading bit pattern 1110. The subsequent two bytes each start with the leading bit pattern 10. Adding the four significant bits of the first byte to the six significant bits of each subsequent byte gives a total of 16 significant bits, as required to represent UCS-2. The one-byte patterns are the ASCII codes, so ASCII is a subset of UTF-8. Presumably the ASCII codes are also the most frequent, making UTF-8 more space-efficient than UCS-2.

UTF-8 seems arbitrary, but it is a modification of a regular scheme. Let each byte be either the first of a sequence or in the body of a sequence. Let the initial number of 1 bits in the first byte signify the length of the encoding. Since a leading 0 would not make sense in the first byte, use it to identify the bytes in the body. That is UTF-8 except the code for the ``body'' bytes and ``first byte of sequence of length one'' have been swapped to accommodate ASCII.

UTF-8 also provides an ASCII-based escape encoding (\uxxxx where the xxxx are hex digits) which allows any UCS-2 character to be inserted in input, including Java program text. For example, the sequence \u007b in Java source text is equivalent to the ASCII character {. Not all input devices have the \ character. The consequence is that these devices cannot use the \uxxxx style escape to prepare arbitrary Unicode, which may be a problem for the current Java Language definition. I have seen a proposal from Europe to allow the form ??xxxx in addition to \uxxxx. o