wchar_t
type ¶The ISO C and POSIX standard creators made an attempt to overcome the
dead end regarding the char
type. They introduced
<wchar.h>
, and
<wctype.h>
that were meant to supplant
the ones in <ctype.h>
.
Unfortunately, this API and its implementation has numerous problems:
wchar_t
is a
16-bit type. This means that it can never accommodate an entire Unicode
character. Either the wchar_t *
strings are limited to
characters in UCS-2 (the “Basic Multilingual Plane” of Unicode), or
– if wchar_t *
strings are encoded in UTF-16 – a
wchar_t
represents only half of a character in the worst case,
making the <wctype.h>
functions pointless.
wchar_t
encoding is locale dependent
and undocumented. This means, if you want to know any property of a
wchar_t
character, other than the properties defined by
<wctype.h>
– such as whether it’s a dash, currency symbol,
paragraph separator, or similar –, you have to convert it to
char *
encoding first, by use of the function wctomb
.
fgetwc
and fgetws
, and when the input
stream/file is not in the expected encoding, you have no way to
determine the invalid byte sequence and do some corrective action. If
you use these functions, your program becomes “garbage in - more
garbage out” or “garbage in - abort”.
As a consequence, it is better to use multibyte strings. Such multibyte
strings can bypass limitations of the wchar_t
type, if you use
functions defined in Gnulib and GNU libunistring for text processing.
They can also faithfully transport malformed characters that were
present in the input, without requiring the program to produce garbage
or abort.