8. Character encoding¶

8.1. Good reads¶

8.2. Linux terminal and encoding¶

8.2.1. POSIX locale environement variables¶

In POSIX systems, a set of environment variables share knowledge on locale and encoding set for the system [1]. The LANG, LC_ALL and LC_CTYPE environment variables are those of interest for encoding. Each of their value has the following pattern:

$lang[.$codeset[@$variant]

The $codeset part refers to the configured encoding, and should be extrapolated for I/O terminal operations in the following order [3] :

LC_ALL
LC_CTYPE
LANG

LANG¶: If LC_ALL is not set, then locale parameters whose corresponding LC_* variables are not set default to the value of LANG.

LC_ALL¶: When set, the value of this variable overrides the values of all other LC_* variables.

LC_CTYPE¶: Controls the way upper to lowercase conversion takes place.

8.2.2. Linux terminal¶

If the console is in utf8 mode (see unicode_start(1)) then the kernel expects that user program output is coded as UTF-8 (see utf-8(7)), and converts that to Unicode (ucs2). Otherwise, a translation table is used from the 8-bit program output to 16-bit Unicode values. Such a translation table is called a Unicode console map. There are four of them: three built into the kernel, the fourth settable using the -m option of setfont. An escape sequence chooses between these four tables; after loading a cmap, setfont will output the escape sequence Esc ( K that makes it the active translation.

—setfont(8)

8.2.3. Linux commands¶

Some command-line utilities have problems with multibyte characters. For example, tr always assumes that one character is represented as one byte, regardless of the locale.

—Introduction to Unicode — Using Unicode in Linux[2]

8.3. Program encoding strategy¶

[1]	POSIX.1-2008, sec. 7.1

[2]	Introduction to Unicode — Using Unicode in Linux

[3]	This is how ncurse, tcell and other popular terminal libraries proceed.