Monday, January 28, 2008

A tutorial on character code issues

The basics
In computers and in data transmission between them, i.e. in digital data processing and transfer, data is internally presented as octets, as a rule. An octet is a small unit of data with a numerical value between 0 and 255, inclusively.

The numerical values are presented in the normal (decimal) notation here, but notice that other presentations are used too, especially octal (base 8) or hexadecimal (base 16) notation. Octets are often called bytes, but in principle, octet is a more definite concept than byte. Internally, octets consist of eight bits (hence the name, from Latin octo 'eight'), but we need not go into bit level here. However, you might need to know what the phrase "first bit set" or "sign bit set" means, since it is often used. In terms of numerical values of octets, it means that the value is greater than 127. In various contexts, such octets are sometimes interpreted as negative numbers, and this may cause various problems.
Different conventions can be established as regards to how an octet or a sequence of octets presents some data. For instance, four consecutive octets often form a unit that presents a real number according to a specific standard. We are here interested in the presentation of character data (or string data; a string is a sequence of characters) only.
In the simplest case, which is still widely used, one octet corresponds to one character according to some mapping table (encoding). Naturally, this allows at most 256 different characters being represented. There are several different encodings, such as the well-known ASCII encoding and the ISO Latin family of encodings. The correct interpretation and processing of character data of course requires knowledge about the encoding used. For HTML documents, such information should be sent by the Web server along with the document itself, using so-called HTTP headers (cf. to MIME headers).
Previously the ASCII encoding was usually assumed by default (and it is still very common). Nowadays ISO Latin 1, which can be regarded as an extension of ASCII, is often the default. The current trend is to avoid giving such a special position to ISO Latin 1 among the variety of encodings.
Definitions
The following definitions are not universally accepted and used. In fact, one of the greatest causes of confusion around character set issues is that terminology varies and is sometimes misleading.
character repertoire
A set of distinct characters. No specific internal presentation in computers or data transfer is assumed. The repertoire per se does not even define an ordering for the characters; ordering for sorting and other purposes is to be specified separately. A character repertoire is usually defined by specifying names of characters and a sample (or reference) presentation of characters in visible form. Notice that a character repertoire may contain characters which look the same in some presentations but are regarded as logically distinct, such as Latin uppercase A, Cyrillic uppercase A, and Greek uppercase alpha. For more about this, see a discussion of the character concept later in this document.
character code
A mapping, often presented in tabular form, which defines a one-to-one correspondence between characters in a character repertoire and a set of nonnegative integers. That is, it assigns a unique numerical code, a code position, to each character in the repertoire. In addition to being often presented as one or more tables, the code as a whole can be regarded as a single table and the code positions as indexes. As synonyms for "code position", the following terms are also in use: code number, code value, code element, code point, code set value - and just code. Note: The set of nonnegative integers corresponding to characters need not consist of consecutive numbers; in fact, most character codes have "holes", such as code positions reserved for control functions or for eventual future use to be defined later.
character encoding
A method (algorithm) for presenting characters in digital form by mapping sequences of code numbers of characters into sequences of octets. In the simplest case, each character is mapped to an integer in the range 0 - 255 according to a character code and these are used as such as octets. Naturally, this only works for character repertoires with at most 256 characters. For larger sets, more complicated encodings are needed. Encodings have names, which can be registered.
Notice that a character code assumes or implicitly defines a character repertoire. A character encoding could, in principle, be viewed purely as a method of mapping a sequence of integers to a sequence of octets. However, quite often an encoding is specified in terms of a character code (and the implied character repertoire). The logical structure is still the following:
A character repertoire specifies a collection of characters, such as "a", "!", and "ä".
A character code defines numeric codes for characters in a repertoire. For example, in the ISO 10646 character code the numeric codes for "a", "!", "ä", and "‰" (per mille sign) are 97, 33, 228, and 8240. (Note: Especially the per mille sign, presenting 0/00 as a single character, can be shown incorrectly on display or on paper. That would be an illustration of the symptoms of the problems we are discussing.)
A character encoding defines how sequences of numeric codes are presented as (i.e., mapped to) sequences of octets. In one possible encoding for ISO 10646, the string a!ä‰ is presented as the following sequence of octets (using two octets for each character): 0, 97, 0, 33, 0, 228, 32, 48.
For a more rigorous explanation of these basic concepts, see Unicode Technical Report #17: Character Encoding Model.
The phrase character set is used in a variety of meanings. It might denotes just a character repertoire but it may also refer to a character code, and quite often a particular character encoding is implied too.
Unfortunately the word charset is used to refer to an encoding, causing much confusion. It is even the official term to be used in several contexts by Internet protocols, in MIME headers.
Quite often the choice of a character repertoire, code, or encoding is presented as the choice of a language. For example, Web browsers typically confuse things quite a lot in this area. A pulldown menu in a program might be labeled "Languages", yet consist of character encoding choices (only). A language setting is quite distinct from character issues, although naturally each language has its own requirements on character repertoire. Even more seriously, programs and their documentation very often confuse the above-mentioned issues with the selection of a font.
Examples of character codes
Good old ASCII
The basics of ASCII
The name ASCII, originally an abbreviation for "American Standard Code for Information Interchange", denotes an old character repertoire, code, and encoding.
Most character codes currently in use contain ASCII as their subset in some sense. ASCII is the safest character repertoire to be used in data transfer. However, not even all ASCII characters are "safe"!
ASCII has been used and is used so widely that often the word ASCII refers to "text" or "plain text" in general, even if the character code is something else! The words "ASCII file" quite often mean any text file as opposite to a binary file.
The definition of ASCII also specifies a set of control codes ("control characters") such as linefeed (LF) and escape (ESC). But the character repertoire proper, consisting of the printable characters of ASCII, is the following (where the first item is the blank, or space, character):
! " # $ % & ' ( ) * + , - . /
0 1 2 3 4 5 6 7 8 9 : ; < = > ?
@ A B C D E F G H I J K L M N O
P Q R S T U V W X Y Z [ \ ] ^ _
` a b c d e f g h i j k l m n o
p q r s t u v w x y z { } ~
The appearance of characters varies, of course, especially for some special characters. Some of the variation and other details are explained in The ISO Latin 1 character repertoire - a description with usage notes.
A formal view on ASCII
The character code defined by the ASCII standard is the following: code values are assigned to characters consecutively in the order in which the characters are listed above (rowwise), starting from 32 (assigned to the blank) and ending up with 126 (assigned to the tilde character ~). Positions 0 through 31 and 127 are reserved for control codes. They have standardized names and descriptions, but in fact their usage varies a lot.
The character encoding specified by the ASCII standard is very simple, and the most obvious one for any character code where the code numbers do not exceed 255: each code number is presented as an octet with the same value.
Octets 128 - 255 are not used in ASCII. (This allows programs to use the first, most significant bit of an octet as a parity bit, for example.)
National variants of ASCII
There are several national variants of ASCII. In such variants, some special characters have been replaced by national letters (and other symbols). There is great variation here, and even within one country and for one language there might be different variants. The original ASCII is therefore often referred to as US-ASCII; the formal standard (by ANSI) is ANSI X3.4-1986.
The phrase "original ASCII" is perhaps not quite adequate, since the creation of ASCII started in late 1950s, and several additions and modifications were made in the 1960s. The 1963 version had several unassigned code positions. The ANSI standard, where those positions were assigned, mainly to accommodate lower case letters, was approved in 1967/1968, later modified slightly. For the early history, including pre-ASCII character codes, see Steven J. Searle's A Brief History of Character Codes in North America, Europe, and East Asia and Tom Jennings' ASCII: American Standard Code for Information Infiltration. See also Jim Price's ASCII Chart, Mary Brandel's 1963: ASCII Debuts, and the computer history documents, including the background and creation of ASCII, written by Bob Bemer, "father of ASCII".
The international standard ISO 646 defines a character set similar to US-ASCII but with code positions corresponding to US-ASCII characters @[\]{} as "national use positions". It also gives some liberties with characters #$^`~. The standard also defines "international reference version (IRV)", which is (in the 1991 edition of ISO 646) identical to US-ASCII. Ecma International has issued the ECMA-6 standard, which is equivalent to ISO 646 and is freely available on the Web.
Within the framework of ISO 646, and partly otherwise too, several "national variants of ASCII" have been defined, assigning different letters and symbols to the "national use" positions. Thus, the characters that appear in those positions - including those in US-ASCII - are somewhat "unsafe" in international data transfer, although this problem is losing significance. The trend is towards using the corresponding codes strictly for US-ASCII meanings; national characters are handled otherwise, giving them their own, unique and universal code positions in character codes larger than ASCII. But old software and devices may still reflect various "national variants of ASCII".
The following table lists ASCII characters which might be replaced by other characters in national variants of ASCII. (That is, the code positions of these US-ASCII characters might be occupied by other characters needed for national use.) The lists of characters appearing in national variants are not intended to be exhaustive, just typical examples.
dec oct hex glyph official Unicode name National variants
35 43 23 # number sign £ Ù
36 44 24 $ dollar sign ¤
64 100 40 @ commercial at É § Ä à ³
91 133 5B [ left square bracket Ä Æ ° â ¡ ÿ é
92 134 5C \ reverse solidus Ö Ø ç Ñ ½ ¥
93 135 5D ] right square bracket Å Ü § ê é ¿
94 136 5E ^ circumflex accent Ü î
95 137 5F _ low line è
96 140 60 ` grave accent é ä µ ô ù
123 173 7B { left curly bracket ä æ é à ° ¨
124 174 7C vertical line ö ø ù ò ñ f
125 175 7D } right curly bracket å ü è ç ¼
126 176 7E ~ tilde ü ¯ ß ¨ û ì ´ _
Almost all of the characters used in the national variants have been incorporated into ISO Latin 1. Systems that support ISO Latin 1 in principle may still reflect the use of national variants of ASCII in some details; for example, an ASCII character might get printed or displayed according to some national variant. Thus, even "plain ASCII text" is thereby not always portable from one system or application to another.
Control characters (control codes)
The rôle of the so-called control characters in character codes is somewhat obscure. Character codes often contain code positions which are not assigned to any visible character but reserved for control purposes. For example, in communication between a terminal and a computer using the ASCII code, the computer could regard octet 3 as a request for terminating the currently running process. Some older character code standards contain explicit descriptions of such conventions whereas newer standards just reserve some positions for such usage, to be defined in separate standards or agreements such as "C0 controls" (tabulated in my document on ASCII control codes) and "C1 controls", or specifically ISO 6429. And although the definition quoted above suggests that "control characters" might be regarded as characters in the Unicode terminology, perhaps it is more natural to regard them as control codes.
Control codes can be used for device control such as cursor movement, page eject, or changing colors. Quite often they are used in combination with codes for graphic characters, so that a device driver is expected to interpret the combination as a specific command and not display the graphic character(s) contained in it. For example, in the classical VT100 controls, ESC followed by the code corresponding to the letter "A" or something more complicated (depending on mode settings) moves the cursor up. To take a different example, the Emacs editor treats ESC A as a request to move to the beginning of a sentence. Note that the ESC control code is logically distinct from the ESC key in a keyboard, and many other things than pressing ESC might cause the ESC control code to be sent. Also note that phrases like "escape sequences" are often used to refer to things that don't involve ESC at all and operate at a quite different level. Bob Bemer, the inventor of ESC, has written a "vignette" about it: That Powerful ESCAPE Character -- Key and Sequences.
One possible form of device control is changing the way a device interprets the data (octets) that it receives. For example, a control code followed by some data in a specific format might be interpreted so that any subsequent octets to be interpreted according to a table identified in some specific way. This is often called "code page switching", and it means that control codes could be used change the character encoding. And it is then more logical to consider the control codes and associated data at the level of fundamental interpretation of data rather than direct device control. The international standard ISO 2022 defines powerful facilities for using different 8-bit character codes in a document.
Widely used formatting control codes include carriage return (CR), linefeed (LF), and horizontal tab (HT), which in ASCII occupy code positions 13, 10, and 9. The names (or abbreviations) suggest generic meanings, but the actual meanings are defined partly in each character code definition, partly - and more importantly - by various other conventions "above" the character level. The "formatting" codes might be seen as a special case of device control, in a sense, but more naturally, a CR or a LF or a CR LF pair (to mention the most common conventions) when used in a text file simply indicates a new line. As regards to control codes used for line structuring, see Unicode technical report #13 Unicode Newline Guidelines. See also my Unicode line breaking rules: explanations and criticism. The HT (TAB) character is often used for real "tabbing" to some predefined writing position. But it is also used e.g. for indicating data boundaries, without any particular presentational effect, for example in the widely used "tab separated values" (TSV) data format.
A control code, or a "control character" cannot have a graphic presentation (a glyph) in the same way as normal characters have. However, in Unicode there is a separate block Control Pictures which contains characters that can be used to indicate the presence of a control code. They are of course quite distinct from the control codes they symbolize - U+241B symbol for escape is not the same as U+001B escape! On the other hand, a control code might occasionally be displayed, by some programs, in a visible form, perhaps describing the control action rather than the code. For example, upon receiving octet 3 in the example situation above, a program might echo back (onto the terminal) *** or INTERRUPT or ^C. All such notations are program-specific conventions. Some control codes are sometimes named in a manner which seems to bind them to characters. In particular, control codes 1, 2, 3, ... are often called control-A, control-B, control-C, etc. (or CTRL-A or C-A or whatever). This is associated with the fact that on many keyboards, control codes can be produced (for sending to a computer) using a special key labeled "Control" or "Ctrl" or "CTR" or something like that together with letter keys A, B, C, ... This in turn is related to the fact that the code numbers of characters and control codes have been assigned so that the code of "Control-X" is obtained from the code of the upper case letter X by a simple operation (subtracting 64 decimal). But such things imply no real relationships between letters and control codes. The control code 3, or "Control-C", is not a variant of letter C at all, and its meaning is not associated with the meaning of C.

Artikel yang Berkaitan

0 komentar:

Post a Comment