Strings in C are usually represented by a character sequence with a terminating NUL character. A ‘char *’, pointer to the first byte of this character sequence, is what gets passed around as function argument or return value.
The major restriction of this string representation is that it cannot
handle strings that contain NUL characters: such strings will appear
shorter than they were meant to be. In most application areas, this is
not a problem, and the char *
type is well usable.
A second problem of this string representation is that taking a substring is not cheap: it either requires a memory allocation or a destructive modification of the string. The former has a runtime cost; the latter complicates the logic of the program. This matters for application areas that analyze text, such as parsers.
In areas where strings with embedded NUL characters need to be handled
or where taking substrings is a recurrent operation,
the common approach is to use a char *ptr
pointer variable
together with a size_t nbytes
variable (or an idx_t nbytes
variable, if you want to avoid problems due to integer overflow). This
works fine in code that constructs or manipulates strings with embedded
NUL characters. But when it comes to storing them, for example
in an array or as key or value of a hash table, one needs a type that
combines these two fields.
The Gnulib modules string-desc
, xstring-desc
, and
string-desc-quotearg
provide such a type. We call it a
“string descriptor” and name it string_desc_t
.
The type string_desc_t
is a struct that contains a pointer to the
first byte and the number of bytes of the memory region that make up the
string. An additional terminating NUL byte, that may be present in
memory, is not included in this byte count. This type implements the
same concept as std::string_view
in C++, or the String
type in Java.
A string_desc_t
can be passed to a function as an argument, or
can be the return value of a function. This is type-safe: If, by
mistake, a programmer passes a string_desc_t
to a function that
expects a char *
argument, or vice versa, or assigns a
string_desc_t
value to a variable of type char *
, or
vice versa, the compiler will report an error.
Functions related to string descriptors are provided:
"string-desc.h"
,
"string-desc.h"
,
"xstring-desc.h"
,
"string-desc.h"
.
For outputting a string descriptor, the *printf
family of
functions cannot be used directly. A format string directive such as
"%.*s"
would not work:
int
, and thus
would not work for strings longer than INT_MAX
bytes.
Therefore Gnulib offers
string_desc_fwrite
that outputs a string descriptor to
a FILE
stream,
string_desc_write
that outputs a string descriptor to
a file descriptor,
quotearg
based functions, that
allow to specify the escaping rules in detail.
The functionality is thus split across three modules as follows:
string-desc
, under LGPL, defines the type and
elementary functions.
xstring-desc
, under GPL, defines the memory-allocating
functions with out-of-memory checking.
string-desc-quotearg
, under GPL, defines the
quotearg
based functions.