This page describes ongoing work to improve GCC's infrastructure
for RTL based optimizers. The work is done on a branch in GCC's CVS
tree called cfg-branch
, which is available to experiment
in. As the patches stabilize, they will be submitted for review and
acceptance into the mainline development tree. Please contact Jan Hubicka <jh@suse.cz> if you
would like to contribute.
The purpose of work on the CFG branch is to update GCC's older optimization passes to use the modern infrastructure which was not available when they were written. A second purpose is to continue improving the infrastructure for RTL-based optimizations.
Currently, work concentrates on the control flow graph (CFG) data structure, originally contributed by Richard Henderson. We would like to convert all of the RTL-based optimizer passes to use it and keep it accurate. This will allow us to store more information directly in this structure, instead of using "notes" in the instruction stream which are difficult to keep accurate during optimization.
In addition to improving the infrastructure and updating the existing RTL-based optimization passes, we are implementing new optimization passes. We plan to implement optimizations that perform code duplication first such as superblock formation and loop peeling. We also plan to reimplement the loop unrolling pass.
Some ideas that the cfg-branch will implement are known only in the compiler research literature but are new concepts for GCC. We'd like to explain the concepts briefly, give an overview of what we want to achieve in GCC and describe the current status.
This work is done by Jan Hubicka.
Loop Peeling is a form of loop unrolling where the
first iterations from a loop are unrolled. It removes
i
iterations from the beginning of a loop and adds
i
copies of the loop body directly before the start of
the loop.
The loop peeling optimization will be done as part of the superblock formation. The tracer should peel as many iterations as can be predicted. Zdenek has also implemented simple loop peeling as part of the new loop optimizer.
The code is on the branch.
The tracer peels loops with one predicted iteration. We should try to peel more than one iteration.
This work is done by Zdenek Dvorak.
The current loop optimizer uses information passed by the front end to discover loop constructs to simplify flow analysis. It is difficult to keep the information up-to-date and nowadays it is easy to implement the loop discovery code on CFG.
We want to use the Michael Hayes' loop discovery code and slowly replace existing features of loop optimizer by new one. So far Zdenek has modified the datastructure to allow easier updating and implemented loop unrolling, peeling and unswitching on the top of it.
The main changes the loop infrastructure has been merged to mainline. Rest is being tunned on the branch.
This work is done by Josef Zlomek.
Software Trace Cache is a basic block layout algorithm that also does code duplication during the process. While optimal code layout is an NP complete problem, allowing a limited amount of duplication should eliminate the common drawbacks of current algorithms and keep code size under control. The purpose of this pass is to minimize the number of branches and cache misses. It uses code duplication to avoid jumps. A trivial example is copying the return instruction instead of jumping to it. See [6] for a detailed description.
The implementation will only do intraprocedural optimizations since interprocedural optimizations are out of reach of current GCC.
The code is in the branch. The exact benefits needs to be measured but on non-PDO runs it brings 5% to eon benchmark (from CPU2000).
This pass constructs webs as commonly used for register allocation purposes. After that each web gets assigned individual pseudo. This allows our register allocation pass to operate on pseudos directly, but also strengthens several other optimization passes, such as CSE, loop optimizer and trivial dead code remover.
While future SSA passes will have same effect on the low-level RTL generation, we still believe this pass to be useful right before the register allocation pass or after the code duplication passes that need to unshare temporaries, such as tracer or loop unroller.
The implementation of the web pass uses Michael Hayes's dataflow module to construct du-chains and unionfind to produce webs.
The code is on the branch.
This pass improves SPECint2000 performance by roughly 1%. We need to work on updating debug information in this pass.
This pass coalesces multiple registers into single one in order to avoid register to register copies that our register allocator is not able to deal with very well. It is a kind of temporary solution until the new register allocator is in place. At that point, the benefits are questionable, but still it is more effective (and probably less expensive) than the current copy propagation implementation.
It is designed as stand alone pass constructing conflict graph and coalescing registers run after GCSE.
The code is on the branch.
This pass improves some of SPECint2000 tests and degrades others basically because of lack of live range splitting and increased register pressure in final program. It helps by about 1 point on peak SPECint runs.
This work is done by Jan Hubicka.
This optimization is very close to instruction combining. The
compiler discovers conditionals with common expression like patterns
(a || b)
and (a && b)
and attempts
to create compound conditions.
When using condition registers arithmetics this is always possible,
but not always profitable for all architectures. Trivial, still
common, cases can be handled. Some examples for these conversions
are: (a || b)
can be converted to (a | b)
,
(a == b || c == d)
to (a^b)|(c^d)
and,
(a == 3 || a == 4)
to ((unsigned)(a - 3) <
2)
.
This is implemented as part of the if-conversion pass. It basically finds the patterns of two jumps in the CFG that can be combined and then attempts to to apply several simplifications. Profile data is used to control the amount of optimizations.
The code is on the branch. Benefits for Athlon CPU are about 0.5%. The code is being reviewed for inclusion by Richard Henderson.
This work is done by Zdenek Dvorak.
GCC currently does not support profiling of threads.
This is only needed for machines that have no atomic adds and for SMP machines. We currently plan to do this with per-thread counters. At the start of each function, a function from GCC's runtime environment will be called that returns the pointer.
A first version has been written and is installed to the branch. The merging to mainline is problematic and the code may be dropped
To shorten branches and avoid pollution it is desirable to put frequently executed procedures together and split their infrequently executed parts.
We don't split the function body, but we do categorize functions into different parts. A first version has been written and has been installed on the branch. It adds about 1% to spec2000 runs with profile feedback.
This work is done by Jan Hubicka.
The current RTL representation is too low level for a number of optimizations that we want to implement. This makes the implementation difficult and even ineffective in some cases. MidRTL will be essentially a RTL representation of a program lacking the tons of architectural details making the transformations of instruction chain much easier and therefore being the basis for new optimization passes.
The following outline has been written by Jeff Law:
Particularly for the task of building the machine description for the generic RTL and the translation from generic RTL into target specific RTL. Those two (closely related) tasks are the biggest hunks of infrastructure we need. Conceptually what we want is a md file which describes the set of generic RTL patterns. We're not precisely sure on what predicates to use for operands, so invent something like "generic_operand" which we will more precisely define later. Make all operands use "generic_operand". At least initially accept MEMs, REGs and all constants. The only "tricky" part of "generic_operand" is memory addresses; at least at this stage, generic_operand should accept a MEM with a "generic_address". A "generic address" should be any "generic operand", except another MEM. (ie we don't allow (MEM (MEM)) at this stage. We will likely refine this later. Once you've got a rough framework in place, you have to build out some infrastructure for lowering. Specifically we'll need the ability to have two recognizers. Early in compilation we'll only recognize the generic patterns. When we start lowering we want to recognize the target specific expanders, patterns, splitters, etc. So you'll probably have to hack up genrecog to build two insn-recog files with different prefixes for anything with external scope and some infrastructure to switch between them based on a state variable which indicates we've started the lowering process. For the actual lowering process, we want to avoid twiddling the existing back ends as much as possible. If I recall, the general idea Richard and I came up with was: For each insn: Try to see if it matches a pattern in the target's md file. If it does, then you're done. Similarly, if it matches a splitter, then run the splitter, match the resulting insns in the target md file and your done. Else generic.md should call into optabs.c to expand operations -- this will in large part allow target specific lowering using existing mechanisms. (ie, lowering of operands, running target expanders, etc etc). It's likely we'll need to refine this some, but this is the general idea.
A first version bootstraps and will be hopefully included in the branch soon. It is far from being complete thought.
Achievements that have been merged into mainline already:
Patches that are waiting for review and integration into mainline:
cfglayout.c
)
based on the basic block reordering pass and capable of code
duplication.The tracer also known as tail duplication or superblock formation pass identifies commonly executed sequences of basic blocks, so called traces and duplicates the blocks in order to avoid side entrances to those sequences - forming so-called superblocks. The superblocks are easier to optimize by conventional passes such as CSE and scheduler. In case no extra optimizations apply, our cross-jumping pass merges the code paths again after register allocation.
We plan to do loop peeling and superblock formation in single pass as described in [3].
Enhancements that are currently only in the cfg-branch:
toplev.c
.This work is done by Zdenek Dvorak.
This is an infrastructure enhancement to propagate predictions from the trees representation to the RTL-level.
This work is done by Pavel Nejedly.
Currently the profile is fragile, since there is no verification that the compiled program matches the profiled data. Since the profiler eliminates the redundancy in data, the mismatch is often not discovered at all causing completely nonsensical results. nonsense.
We plan to calculate a checksum (CRC) of each graph when it is constructed. Also the format will be extended so that more information can be recorded, such as histograms for the number of iterations for loops that can then be used for loop peeling and loop unrolling.
We also want to add versioning and further information to make it look like a real file format.
There are a few references to much more advanced profiling systems in [3].
This is done using NOTE_INSN_PREDICTION
emitted in the
stream converted to REG_PREDICTION
later. For instance,
the compiler can now predict branches using goto
as
infrequent. It also allows to handle and specially predict constructs
such as return of (negative) constants, out of range code and, exception
handling where the front end knows that those are infrequent, but the
back end has no information about it.
We can also use it to preserve more information about high-level loops. For instance loop a with continue statement looks like two nested loops to the back end, but the optimizer may be able to estimate that the internal (continue) loop is not that frequent.
The branch was created from the development mainline on 12 November
2001. Its CVS tag is called cfg-branch
.
Check out the cfg-branch
branch with the command
cvs co -r cfg-branch gcc
then configure and build in the normal way.
Please post patches in the usual way
to the gcc-patches list, marked [cfg-branch]
. As this is
a branch the usual maintainer rules do not apply. This branch is
maintained by Jan Hubicka
<jh@suse.cz>. Approval from the usual maintainers will be
needed when submitting patches from the branch for consideration for
the mainline.
Lots of useful information is present at the IMPACT compiler group homepage. Some other papers:
Copyright (C) Free Software Foundation, Inc. Verbatim copying and distribution of this entire article is permitted in any medium, provided this notice is preserved.
These pages are maintained by the GCC team. Last modified 2024-06-10.