17.11 Handling strings with NUL characters

Strings in C are usually represented by a character sequence with a terminating NUL character. A ‘char *’, pointer to the first byte of this character sequence, is what gets passed around as function argument or return value.

The major restriction of this string representation is that it cannot handle strings that contain NUL characters: such strings will appear shorter than they were meant to be. In most application areas, this is not a problem, and the char * type is well usable.

A second problem of this string representation is that taking a substring is not cheap: it either requires a memory allocation or a destructive modification of the string. The former has a runtime cost; the latter complicates the logic of the program. This matters for application areas that analyze text, such as parsers.

In areas where strings with embedded NUL characters need to be handled or where taking substrings is a recurrent operation, the common approach is to use a char *ptr pointer variable together with a size_t nbytes variable (or an idx_t nbytes variable, if you want to avoid problems due to integer overflow). This works fine in code that constructs or manipulates strings with embedded NUL characters. But when it comes to storing them, for example in an array or as key or value of a hash table, one needs a type that combines these two fields.

The Gnulib modules string-desc, xstring-desc, and string-desc-quotearg provide such a type. We call it a “string descriptor” and name it string_desc_t.

The type string_desc_t is a struct that contains a pointer to the first byte and the number of bytes of the memory region that make up the string. An additional terminating NUL byte, that may be present in memory, is not included in this byte count. This type implements the same concept as std::string_view in C++, or the String type in Java.

A string_desc_t can be passed to a function as an argument, or can be the return value of a function. This is type-safe: If, by mistake, a programmer passes a string_desc_t to a function that expects a char * argument, or vice versa, or assigns a string_desc_t value to a variable of type char *, or vice versa, the compiler will report an error.

Functions related to string descriptors are provided:

For outputting a string descriptor, the *printf family of functions cannot be used directly. A format string directive such as "%.*s" would not work:

Therefore Gnulib offers

The functionality is thus split across three modules as follows: