8. Character encoding¶
8.1. Good reads¶
8.2. Linux terminal and encoding¶
8.2.1. POSIX locale environement variables¶
In POSIX systems, a set of environment variables share knowledge on locale and encoding set for the system [1].
The LANG
, LC_ALL
and LC_CTYPE
environment variables are those of interest for encoding. Each of their value has the following pattern:
The $codeset
part refers to the configured encoding, and should be extrapolated for I/O terminal operations in the following order [3] :
-
LANG
¶ If
LC_ALL
is not set, then locale parameters whose correspondingLC_*
variables are not set default to the value ofLANG
.
-
LC_ALL
¶ When set, the value of this variable overrides the values of all other
LC_*
variables.
-
LC_CTYPE
¶ Controls the way upper to lowercase conversion takes place.
8.2.2. Linux terminal¶
If the console is in utf8 mode (see unicode_start(1)) then the kernel expects that user program output is coded as UTF-8 (see utf-8(7)), and converts that to Unicode (ucs2). Otherwise, a translation table is used from the 8-bit program output to 16-bit Unicode values. Such a translation table is called a Unicode console map. There are four of them: three built into the kernel, the fourth settable using the -m option of setfont. An escape sequence chooses between these four tables; after loading a cmap, setfont will output the escape sequence Esc ( K that makes it the active translation.
8.3. Program encoding strategy¶
[1] | POSIX.1-2008, sec. 7.1 |
[2] | Introduction to Unicode — Using Unicode in Linux |
[3] | This is how ncurse, tcell and other popular terminal libraries proceed. |