Locale encodings (GNU libunistring)

1.3 Locale encodings

A locale is a set of cultural conventions. According to POSIX, for a program, at any moment, there is one locale being designated as the “current locale”. (Actually, POSIX supports also one locale per thread, but this feature is not yet universally implemented and not widely used.) The locale is partitioned into several aspects, called the “categories” of the locale. The main various aspects are:

The character encoding and the character properties. This is the LC_CTYPE category.
The sorting rules for text. This is the LC_COLLATE category.
The language specific translations of messages. This is the LC_MESSAGES category.
The formatting rules for numbers, such as the decimal separator. This is the LC_NUMERIC category.
The formatting rules for amounts of money. This is the LC_MONETARY category.
The formatting of date and time. This is the LC_TIME category.

In particular, the LC_CTYPE category of the current locale determines the character encoding. This is the encoding of ‘char *’ strings. We also call it the “locale encoding”. GNU libunistring has a function, locale_charset, that returns a standardized (platform independent) name for this encoding.

All locale encodings used on glibc systems are essentially ASCII compatible: Most graphic ASCII characters have the same representation, as a single byte, in that encoding as in ASCII.

Among the possible locale encodings are UTF-8 and GB18030. Both allow to represent any Unicode character as a sequence of bytes. UTF-8 is used in most of the world, whereas GB18030 is used in the People’s Republic of China, because it is backward compatible with the GB2312 encoding that was used in this country earlier.

The legacy locale encodings, ISO-8859-15 (which supplanted ISO-8859-1 in most of Europe), ISO-8859-2, KOI8-R, EUC-JP, etc., are still in use in some places, though.

UTF-16 and UTF-32 are not used as locale encodings, because they are not ASCII compatible.