Deficiencies of GCC's optimizer

This page lists places where GCC's code generation is suboptimal. Although the examples are small, the problems are usually quite deep.

Note: unless otherwise specified, all examples have been compiled with the current CVS tree as of the date of the example, on x86, with -O2 -fomit-frame-pointer -fschedule-insns. (The x86 back end disables -fschedule-insns, which is something that should be revisited, because it almost always gives better code when I turn it back on.)

Table of Contents


Putting constants in special sections.

If a function has been placed in a special section via attributes, we may want to put its static data and string constants in a special section too. But which one? (Being able to specify a section for string constants would be useful for the Linux kernel.)


Un-cse.

Perhaps we should have an un-cse step right after cse, which tries to replace a reg with its value if the value can be substituted for the reg everywhere, if that looks like an improvement. Which is if the reg is used only a few times. Use rtx_cost to determine if the change is really an improvement.


Clean up how cse works.

The scheme is that each value has just one hash entry. The first_same_value and next_same_value chains are no longer needed.

For arithmetic, each hash table elt has the following slots:

So, if we want to enter (plus:SI (reg:SI 30) (const_int 104)), we first enter (const_int 104) and find the entry that (reg:SI 30) now points to. Then we put these elts into operands 0 and 1 of a new elt. We put PLUS and SI into the new elt.

Registers and mem refs would never be entered into the table as such. However, the values they contain would be entered. There would be a table indexed by regno which points at the hash entry for the value in that reg.

The hash entry index now plays the role of a qty number. We still need qty_table and reg_eqv_table to record which regs share a particular qty.

When a reg is used whose contents are unknown, we need to create a hash table entry whose contents say "unknown", as a place holder for whatever the reg contains. If that reg is added to something, then the hash entry for the sum will refer to the "unknown" entry. Use UNKNOWN for the rtx code in this entry. This replaces make_new_qty.

For a constant, a unique hash entry would be made based on the value of the constant.

What about MEM? Each time a memory address is referenced, we need a qty (a hash table elt) to represent what is in it. (Just as for a register.) If this isn't known, create one, just as for a reg whose contents are unknown.

We need a way to find all mem refs that still contain a certain value. Do this with a chain of hash elts (for memory addresses) that point to locations that hold the value. The hash elt for the value itself should point to the start of the chain. It would be good for the hash elt for an address to point to the hash elt for the contents of that address (but this ptr can be null if the contents have never been entered).

With this data structure, nothing need ever be invalidated except the lists of which regs or mems hold a particular value. It is easy to see if there is a reg or mem that is equiv to a particular value. If the value is constant, it is always explicitly constant.


Loop optimization

Strength reduction and iteration variable elimination could be smarter. They should know how to decide which iteration variables are not worth making explicit because they can be computed as part of an address calculation. Based on this information, they should decide when it is desirable to eliminate one iteration variable and create another in its place.

It should be possible to compute what the value of an iteration variable will be at the end of the loop, and eliminate the variable within the loop by computing that value at the loop end.


Using constraints on values

Many operations could be simplified based on knowledge of the minimum and maximum possible values of a register at any particular time. These limits could come from the data types in the tree, via rtl generation, or they can be deduced from operations that are performed. For example, the result of an and operation one of whose operands is 7 must be in the range 0 to 7. Compare instructions also tell something about the possible values of the operand, in the code beyond the test.

Value constraints can be used to determine the results of a further comparison. They can also indicate that certain and operations are redundant. Constraints might permit a decrement and branch instruction that checks zeroness to be used when the user has specified to exit if negative.


Change the type of a variable

Sometimes a variable is declared as int, it is assigned only once from a value of type char, and then it is used only by comparison against constants. On many machines, better code would result if the variable had type char. If the compiler could detect this case, it could change the declaration of the variable and change all the places that use it.


Better handling for very sparse switches

There may be cases where it would be better to compile a switch statement to use a fixed hash table rather than the current combination of jump tables and binary search.


Order of subexpressions

It might be possible to make better code by paying attention to the order in which to generate code for subexpressions of an expression.


Distributive law

The C expression *(X + 4 * (Y + C)) compiles better on certain machines if rewritten as *(X + 4*C + 4*Y) because of known addressing modes. It may be tricky to determine when, and for which machines, to use each alternative.

Some work has been done on this, in combine.c.


Better builtin string functions

Although GCC implements numerous optimizations of the standard C library's string, math and I/O functions, there are still plenty more transformations that could be implemented.

The GNU libc also currently contains macros to optimize calls to some string functions with constant arguments and those that can be implemented by processor specific instructions. These transformations are better performed in GCC, both to reduce the overhead of macro expansion and to take advantage of the functions attributes, for example to avoid a second call to a pure function altogether. The use of these macros tend to cause huge increase in the size of preprocessed source if nested; for example, each nested call to strcpy expands the source 20-fold, with four nested calls having an expansion ten megabytes in size. GCC then consumes a huge amount of memory compiling such expressions. The remaining optimizations need to be implemented in GCC and then disabled in glibc, with benefits to other systems as well, and the potential to use information GCC has about alignment.

All the string functions act as if they access individual characters, so care may need to be taken that no -fstrict-aliasing problems occur when internal uses of other types are generated. Also, the arguments to the string function must be evaluated exactly once each (if they have any side effects), even though the call to the string function might be optimized away.

Care must be taken that any optimizations in GCC are standards-conforming in terms of not possibly accessing beyond the arrays involved (possibly within a function call created in the optimization); whereas the glibc macros know the glibc implementation and how much memory it might access, GCC optimizations can't.

There are some further optimizations in glibc not covered here, that either optimize calls to non-ISO C functions or call glibc internal functions in the expansion of the macro.

Many of these optimizations should not be applied if -Os is specified.


Data prefetch support

Loads from memory can take many cycles if the loaded data is not in a cache line. Cache misses can bring a CPU to a halt for several 100 cycles, so minimizing cache misses is one of the most important jobs of a modern compiler. Data prefetch instructions are one tool the compiler can use for this.

Data prefetch instructions allow a compiler to minimize cache-miss latency by loading data into a cache before it it accessed. A separate page describes data prefetch support and optimizations that are in development or already supported by GCC.


Target specific optimizer deficiencies

Almost all code transformations implemented in GCC are target independent by design, but how well they work depends on how accurately an architecture is modelled. For example, combine may fail to combine some instructions if a machine description does not provide a pattern, or if the pattern exists but the predicates are too strong. Another example is the target cost function which is used to determine if a transformation is profitable. And sometimes a transformation that some specific target could benefit from is simply missing in the target independent implementation.

Optimization deficiencies like these are listed per target in the target specific pages below. An issue listed for one target may still also be an issue for other targets, but problems in the target specific pages are problems with, and ideas for the machine descriptions of some target.

Note: If an issue listed in a target specific issues page, but it clearly is a target independent issue, please move it to a page discussing target independent issues