C strings (GNU Gnulib)

16.1.1 The C string representation

The classical representation of a string in C is a sequence of characters, where each character takes up one or more bytes, followed by a terminating NUL byte. This representation is used for strings that are passed by the operating system (in the argv argument of main, for example) and for strings that are passed to the operating system (in system calls such as open). The C type to hold such strings is ‘char *’ or, in places where the string shall not be modified, ‘const char *’. There are many C library functions, standardized by ISO C and POSIX, that assume this representation of strings.

A character encoding, or encoding for short, describes how the elements of a character set are represented as a sequence of bytes. For example, in the ASCII encoding, the UNDERSCORE character is represented by a single byte, with value 0x5F. As another example, the COPYRIGHT SIGN character is represented:

in the ISO-8859-1 encoding, by the single byte 0xA9,
in the UTF-8 encoding, by the two bytes 0xC2 0xA9,
in the GB18030 encoding, by the four bytes 0x81 0x30 0x84 0x38.

Note: The ‘char’ type may be signed or unsigned, depending on the platform. When we talk about the "byte 0xA9" we actually mean the char object whose value is (char) 0xA9; we omit the cast to char in this documentation, for brevity.

In POSIX, the character encoding is determined by the locale. The locale is some environmental attribute that the user can choose.

Depending on the encoding, in general, every character is represented by one or more bytes (up to 4 bytes in practice – but use MB_LEN_MAX instead of the number 4 in the code). When every character is represented by only 1 byte, we speak of an “unibyte locale”, otherwise of a “multibyte locale”.

It is important to realize that the majority of Unix installations nowadays use UTF-8 as locale encoding; therefore, the majority of users are using multibyte locales.

Three important facts to remember are:

A ‘char’ is a byte, not a character.

As a consequence:

The <ctype.h> API, that was designed only with unibyte encodings in mind, is useless nowadays for general text processing; it does not work in multibyte locales.
The strlen function does not return the number of characters in a string. Nor does it return the number of screen columns occupied by a string after it is output. It merely returns the number of bytes occupied by a string.
Truncating a string, for example, with strncpy, can have the effect of truncating it in the middle of a multibyte character. Such a string will, when output, have a garbled character at its end, often represented by a hollow box.

Multibyte does not imply UTF-8 encoding.

While UTF-8 is the most common multibyte encoding, GB18030 is also a supported locale encoding on GNU systems (mostly because it is a Chinese government standard, last revised in 2022).

Searching for a character in a string is not the same as searching for a byte in the string.

Take the above example of COPYRIGHT SIGN in the GB18030 encoding: A byte search will find the bytes '0' and '8' in this string. But a search for the character "0" or "8" in the string "©" must, of course, report “not found”.

As a consequence:

strchr and strrchr do not work with multibyte strings if the locale encoding is GB18030 and the character to be searched is a digit.
strstr does not work with multibyte strings if the locale encoding is different from UTF-8.
strcspn, strpbrk, strspn cannot work correctly in multibyte locales: they assume the second argument is a list of single-byte characters. Even in this simple case, they do not work with multibyte strings if the locale encoding is GB18030 and one of the characters to be searched is a digit.
strsep and strtok_r do not work with multibyte strings unless all of the delimiter characters are ASCII characters < 0x30.
The strcasecmp, strncasecmp, and strcasestr functions do not work with multibyte strings.

Workarounds can be found in Gnulib, in the form of mbs* API functions:

Gnulib has functions mbslen and mbswidth that can be used instead of strlen when the number of characters or the number of screen columns of a string is requested.
Gnulib has functions mbschr and mbsrrchr that are like strchr and strrchr, but work in multibyte locales.
Gnulib has a function mbsstr that is like strstr, but works in multibyte locales.
Gnulib has functions mbscspn, mbspbrk, mbsspn that are like strcspn, strpbrk, strspn, but work in multibyte locales.
Gnulib has functions mbssep and mbstok_r that are like strsep and strtok_r but work in multibyte locales.
Gnulib has functions mbscasecmp, mbsncasecmp, mbspcasecmp, and mbscasestr that are like strcasecmp, strncasecmp, and strcasestr, but work in multibyte locales. Still, the function ulc_casecmp is preferable to these functions.

A C string can contain encoding errors.

Not every NUL-terminated byte sequence represents a valid multibyte string. Byte sequences can contain encoding errors, that is, bytes or byte sequences that are invalid and do not represent characters.

String functions like mbscasecmp and strcoll whose behavior depends on encoding have unspecified behavior on strings containing encoding errors, unless the behavior is specifically documented. If an application needs a particular behavior on these strings it can iterate through them itself, as described in the next subsection.