The classical representation of a string in C is a sequence of
characters, where each character takes up one or more bytes, followed by
a terminating NUL byte. This representation is used for strings that
are passed by the operating system (in the argv
argument of
main
, for example) and for strings that are passed to the
operating system (in system calls such as open
). The C type to
hold such strings is ‘char *’ or, in places where the string shall
not be modified, ‘const char *’. There are many C library
functions, standardized by ISO C and POSIX, that assume this
representation of strings.
A character encoding, or encoding for short, describes
how the elements of a character set are represented as a sequence of
bytes. For example, in the ASCII
encoding, the UNDERSCORE
character is represented by a single byte, with value 0x5F. As another
example, the COPYRIGHT SIGN character is represented:
ISO-8859-1
encoding, by the single byte 0xA9,
UTF-8
encoding, by the two bytes 0xC2 0xA9,
GB18030
encoding, by the four bytes 0x81 0x30 0x84 0x38.
Note: The ‘char’ type may be signed or unsigned, depending on the
platform. When we talk about the "byte 0xA9" we actually mean the
char
object whose value is (char) 0xA9
; we omit the cast
to char
in this documentation, for brevity.
In POSIX, the character encoding is determined by the locale. The locale is some environmental attribute that the user can choose.
Depending on the encoding, in general, every character is represented by
one or more bytes (up to 4 bytes in practice – but
use MB_LEN_MAX
instead of the number 4 in the code).
When every character is represented by only 1 byte, we speak of an
“unibyte locale”, otherwise of a “multibyte locale”.
It is important to realize that the majority of Unix installations nowadays use UTF-8 as locale encoding; therefore, the majority of users are using multibyte locales.
Three important facts to remember are:
A ‘char’ is a byte, not a character. |
As a consequence:
<ctype.h>
API, that was designed only with unibyte
encodings in mind, is useless nowadays for general text processing; it
does not work in multibyte locales.
strlen
function does not return the number of characters
in a string. Nor does it return the number of screen columns occupied
by a string after it is output. It merely returns the number of
bytes occupied by a string.
strncpy
, can have the
effect of truncating it in the middle of a multibyte character. Such
a string will, when output, have a garbled character at its end, often
represented by a hollow box.
Multibyte does not imply UTF-8 encoding. |
While UTF-8 is the most common multibyte encoding, GB18030 is also a supported locale encoding on GNU systems (mostly because it is a Chinese government standard, last revised in 2022).
Searching for a character in a string is not the same as searching for a byte in the string. |
Take the above example of COPYRIGHT SIGN in the GB18030
encoding:
A byte search will find the bytes '0'
and '8'
in this
string. But a search for the character "0" or "8" in the string
"©" must, of course, report “not found”.
As a consequence:
strchr
and strrchr
do not work with multibyte
strings if the locale encoding is GB18030 and the character to be
searched is a digit.
strstr
does not work with multibyte strings if the locale
encoding is different from UTF-8.
strcspn
, strpbrk
, strspn
cannot work
correctly in multibyte locales: they assume the second argument is a
list of single-byte characters. Even in this simple case, they do not
work with multibyte strings if the locale encoding is GB18030 and one of
the characters to be searched is a digit.
strsep
and strtok_r
do not work with multibyte
strings unless all of the delimiter characters are ASCII characters
< 0x30.
strcasecmp
, strncasecmp
, and
strcasestr
functions do not work with multibyte strings.
Workarounds can be found in Gnulib, in the form of mbs*
API
functions:
mbslen
and mbswidth
that can be used
instead of strlen
when the number of characters or the
number of screen columns of a string is requested.
mbschr
and mbsrrchr
that are like
strchr
and strrchr
, but work in multibyte
locales.
mbsstr
that is like strstr
, but
works in multibyte locales.
mbscspn
, mbspbrk
, mbsspn
that
are like strcspn
, strpbrk
, strspn
,
but work in multibyte locales.
mbssep
and mbstok_r
that are like
strsep
and strtok_r
but work in multibyte
locales.
mbscasecmp
, mbsncasecmp
,
mbspcasecmp
, and mbscasestr
that are like
strcasecmp
, strncasecmp
, and
strcasestr
, but work in multibyte locales. Still, the
function ulc_casecmp
is preferable to these functions.
A C string can contain encoding errors. |
Not every NUL-terminated byte sequence represents a valid multibyte string. Byte sequences can contain encoding errors, that is, bytes or byte sequences that are invalid and do not represent characters.
String functions like mbscasecmp
and strcoll
whose
behavior depends on encoding have unspecified behavior on strings
containing encoding errors, unless the behavior is specifically
documented. If an application needs a particular behavior on these
strings it can iterate through them itself, as described in the next
subsection.