1.6 Unicode strings
libunistring supports Unicode strings in three representations:
- UTF-8 strings, through the type ‘uint8_t *’. The units are bytes
(
uint8_t
).
- UTF-16 strings, through the type ‘uint16_t *’, The units are 16-bit
memory words (
uint16_t
).
- UTF-32 strings, through the type ‘uint32_t *’. The units are 32-bit
memory words (
uint32_t
).
As with C strings, there are two variants:
- Unicode strings with a terminating NUL character are represented as
a pointer to the first unit of the string. There is a unit containing
a 0 value at the end. It is considered part of the string for all
memory allocation purposes, but is not considered part of the string
for all other logical purposes.
- Unicode strings where embedded NUL characters are allowed. These
are represented by a pointer to the first unit and the number of units
(not bytes!) of the string. In this setting, there is no trailing
zero-valued unit used as “end marker”.