July 22, 2011
Almost each month I see discussions about character encodings on various software forums. Most of the time these are pure urban myths, legends, hoaxes and rumors. Actually I am surprised how many software developers simply have no idea about the subject.
Here is my attempt to define all these terms in single place and in compact form:
- windows1251, utf-8, ascii, koi8, etc. are all “transport” encodings of UNICODE code points. Encoding defines format of transmission(or storage) of a text – meaningful sequence of characters of human language(s).
- UNICODE code point is a 21-bit number – index of a character in UNICODE database (table).
- Each encoding is characterized by its code unit.
- Code unit – smallest non-dividable element of the sequence. In most of encodings code unit is a byte – 8-bit number. But there are exceptions. For example: ASCII-7 – 7 bits, UTF-16 – 16 bits, UTF-32 32-bit integer.
- Encoding can be full – covers whole UNICODE range (e.g. UTF-8) and it can be partial (for example ASCII) – maps subset of UNICODE code points to code units of particular encoding. Strictly speaking any official encoding like ASCII, Windows-1251, etc. is a UNICODE encoding if it has official and/or well known definition of code units mapping to UNICODE.
- Encoding may have variable number of code units per single UNI-code: UTF-8, UTF-16, GB18030, etc. And there are “fixed” encodings with 1:1 mapping of code unit to UNICODE code points: ASCII, ISO/IEC 8859-1, Windows-1252, and so on.
About UCS-2 and UCS-4.
- UCS-2 – is a 16-bit subset of big UNICODE table. Sometimes is used as a synonym of BMP range – Basic Multilingual Plane.
- UCS-4 – is a full range of UNICODE table (32 bits number where 21 bits are used).
- UCS-2 and UCS-4 are not encodings. These are just names of historic ranges of character codes (UNICODE code points).
- As an examples:
- In my TIScript string is a UTF-16 sequence so it can operate by full UNICODE range. Thus
str.lengthcan be larger than number of characters in string (e.g. for some Far East texts).
str[i]will give you number from 0 to 0xFFFF – value of UTF-16 code unit. But if you will write:
for(var codePoint in "...str..." ) stdout.printf("%d ", codePoint);
you will get sequence of real UNICODE code points from string.
And that is pretty much it. Not a rocket science, is it?