About character encodings, UTF, UCS, et cetera in 21 lines of text …

July 22, 2011

Filed under: Web Application Techologies — Andrew @ 8:43 pm

Almost each month I see discussions about character encodings on various software forums. Most of the time these are pure urban myths, legends, hoaxes and rumors. Actually I am surprised how many software developers simply have no idea about the subject.

Here is my attempt to define all these terms in single place and in compact form:

  1. windows1251, utf-8, ascii, koi8, etc. are all “transport” encodings of UNICODE code points. Encoding defines format of transmission(or storage) of a text – meaningful sequence of characters of human language(s).
  2. UNICODE code point is a 21-bit number – index of a character in UNICODE database (table).
  3. Each encoding is characterized by its code unit.
  4. Code unit – smallest non-dividable element of the sequence. In most of encodings code unit is a byte – 8-bit number. But there are exceptions. For example: ASCII-7 – 7 bits, UTF-16 – 16 bits, UTF-32 32-bit integer.
  5. Encoding can be full – covers whole UNICODE range (e.g. UTF-8) and it can be partial (for example ASCII) – maps subset of UNICODE code points to code units of particular encoding. Strictly speaking any official encoding like ASCII, Windows-1251, etc. is a UNICODE encoding if it has official and/or well known definition of code units mapping to UNICODE.
  6. Encoding may have variable number of code units per single UNI-code: UTF-8, UTF-16, GB18030, etc. And there are “fixed” encodings with 1:1 mapping of code unit to UNICODE code points: ASCII, ISO/IEC 8859-1, Windows-1252, and so on.

About UCS-2 and UCS-4.

  • UCS-2 – is a 16-bit subset of big UNICODE table. Sometimes is used as a synonym of BMP range – Basic Multilingual Plane.
  • UCS-4 – is a full range of UNICODE table (32 bits number where 21 bits are used).
  • UCS-2 and UCS-4 are not encodings. These are just names of historic ranges of character codes (UNICODE code points).
  • As an examples:
    • JavaScript standard (ECMA-262) defines that String instances represent sequences of UCS-2 (!) codes. So character code in JS is limited by 0xFFFF.
    • In my TIScript string is a UTF-16 sequence so it can operate by full UNICODE range. Thus str.length can be larger than number of characters in string (e.g. for some Far East texts). str[i] will give you number from 0 to 0xFFFF – value of UTF-16 code unit. But if you will write:
        for(var codePoint in "...str..." )
           stdout.printf("%d ", codePoint);
      

      you will get sequence of real UNICODE code points from string.

And that is pretty much it. Not a rocket science, is it?