Note: this writeup represents state as of 2002.
cpplib has largely been completed, and is stable at this point. For GCC versions 3.0 and later, it is linked into the C, C++ and Objective C front ends. Most future work will relate to character set issues, performance enhancements and improving cpplib as a stand-alone library.
The integrated preprocessor would benefit from greater integration with the front ends. It still feels like it has been tacked on as an after thought, which is not entirely coincidental.
\uxxxx
,
\Uxxxxxxxx
) are not recognized in identifiers.
Proper support has to be coordinated with the front ends.diagnostic.c
, which is better than writing out and
processing linemarker commands, but still suboptimal.c_lex
in c-lex.c
. Then the front ends
would not have to jump through hoops to remember to concatenate
strings, and we could simplify the parsers a little too.Proper non-ASCII character handling is a hard problem. Users want to be able to write comments and strings in their native language. They want the strings to come out in their native language and not gibberish after translation to object code. Some users also want to use their own alphabet for identifiers in their code. There is no one-to-one or many-to-one map between languages and character set encodings. The subset of ASCII that is included in most modern day character sets does not include all the punctuation C uses; some of the missing punctuation may be present but at a different place than where it is in ASCII. The subset described in ISO646 may not be the smallest subset out there.
At the present time, GCC supports the use of any encoding for source code, as long as it is a strict superset of 7-bit ASCII. By this I mean that all printable (including whitespace) ASCII characters, when they appear as single bytes in a file, stand only for themselves, no matter what the context is. This is true of ISO8859.x, KOI8-R, and UTF8. It is not true of Shift JIS and some other popular Asian character sets. If they are used, GCC may silently mangle the input file. The only known specific example is that a Shift JIS multibyte character ending with 0x5C will be mistaken for a line continuation if it occurs at the end of a line. 0x5C is "\" in ASCII.
Assuming a safe encoding, characters not in the base set listed in
the standard (C99 5.2.1) are syntax errors if they appear outside
strings, character constants, or comments. In strings and character
constants, they are taken literally - converted blindly to numeric
codes, or copied to the assembly output verbatim, depending on the
context. If you use the C99 \u
and \U
escapes, you get UTF8, no exceptions. These too are only supported in
string and character constants.
We intend to improve this as follows:
U+0024
will be permitted in
identifiers if and only if $
is permitted.#pragma
, or rely on the default
established by the user with locale or a command-line option.
The #pragma
, if used, must be the first line in
the file. This will not prevent the multiple include
optimization from working. GCC will also recognize MULE
(Multilingual Emacs) magic comments, byte order marks, and any
other reasonable in-band method of specifying a file's character set.
It's worth noting that the standard C library facilities for "multibyte character sets" are not adequate to implement the above. The basic problem is that neither C89 nor C99 gives you any way to specify the character set of a file directly. You can manipulate the "locale," which indirectly specifies the character set, but that's a global change. Further, locale names are not defined by the C standard nor is there any consistent map between them and character sets.
The Single Unix specification, and possibly also POSIX, provide the
nl_langinfo
and iconv
interfaces which
mostly circumvent these limitations. We may require these interfaces
to be present for complete non-ASCII support to be functional.
One final note: EBCDIC is, and will be, supported as a source character set if and only if GCC is compiled for a host (not a target) which uses EBCDIC natively.
Copyright (C) Free Software Foundation, Inc. Verbatim copying and distribution of this entire article is permitted in any medium, provided this notice is preserved.
These pages are maintained by the GCC team. Last modified 2022-10-26.