Projects to improve performance on IA-64

This page lists projects that are expected to improve the performance of the code that GCC generates for IA-64, more properly known as IPF (Itanium Processor Family). The lists originally came out of the GCC IA-64 Summit that was held June 6, 2001, and many of the comments are from that summit. Later updates are from discussions among people working in this area. Additions and corrections are always welcome.

During the June 2001 summit, developers of proprietary IA-64 compilers stressed that interactions between optimizations for IA-64 can be very significant, more so than with other architectures. People contributing IA-64 improvements are highly encouraged to work closely with people working on related improvements so that adverse interactions can be detected early.

Short-term projects

Track memory origin to allow better alias analysis
At the summit in June 2001, Ross Towle said that some optimizations for IA-64 fall out nicely if data dependence information is as perfect as it can be. At that time GCC did not keep track of this information well at all, and experienced GCC developers reported that alias analysis in GCC during scheduling is extremely weak; it can even lose track of which addresses are supposed to come from the stack frame. It's weak in general and is even weaker in IA-64. Alias analysis is a general infrastructure problem; GCC has no knowledge of cross-block scheduling.

Since then, Richard Kenner has checked in several patches to track memory origins. His changes link each MEM to the declaration it's from so that alias analysis can know that two MEMs from different declarations can't conflict. This allows other things to be specified in a MEM, like alignment. He's also added functionality to prove that two MEMs cannot conflict.
Make better use of alias information
Now that better alias information is available, GCC should make use of it.

What kinds of projects could now make use of Richard's memory origin work? Is the new information available during scheduling? What other optimizations could use it?
Data prefetch support
At the GCC IA-64 Summit in June 2001, developers of other IA-64 compilers said that optimizations involving compiler generated data prefetch are important for IA-64 performance.

GCC 3.1 includes a prefetch RTL pattern that supports data prefetch on a variety of GCC targets, a __builtin_prefetch function, and the optimization -fprefetch-loop-arrays. General information about data prefetch and about data prefetch instructions supported by a variety of GCC targets are described in the Data Prefetch Support section of the Projects list.

Janis Johnson is trying tweaks to the heuristics used for the -fprefetch-loop-arrays optimization to try to get better performance on IA-64.
Use existing dependence distance code
There is dependence distance code already checked into the compiler that no one uses. That information could be hooked into the loop unroller and the prefetcher. For example, this is to check that references to two different array elements within a loop iteration don't conflict. See the code in dependency.c to see if it uses the MEM tracking information and if the dependence distance code itself is ever used in any loop optimization or could be used there. This could also be hooked up to the MEM info struct and used for iteration distance.
Make better use of dependence information in scheduling
Code locality; order functions based on profiling
Code locality is even more important for this architecture than for others where it shows a benefit.

There is an article by Carl Pettis and Bob Hansen about how to order functions based on a call graph: "Profile guided code positioning", http://acm.proxy.nova.edu/pubs/articles/proceedings/pldi/93542/p16-pettis/p16-pettis.pdf.

Steve Christiansen tried using gprof output to create a linker script that orders functions based on run-time call graphs and call counts, but couldn't show that it made a difference, based on SPEC CPU2000 results.
Code locality: exploit existing profile-directed block ordering
Jan Hubicka, together with Richard Henderson and Andreas Jaeger, made several changes to the profile-directed block ordering in GCC for GCC 3.1. This functionality is available through -fbranch-probabilities using data generated by first compiling with -fprofile-arcs. This is described in Infrastructure for Profile Driven Optimizations, which also lists items for future work.

The following items came out of the June 2001 summit as issues to investigate:
- GCC does not split a function into multiple regions, although that has been mentioned as a possibility.
- Profile information could be used to improve linearization of the code; the CFG branch includes some support for trace scheduling.
- Profile information could be used for if-conversion to decide which side of the branch should be predicated.
- Profile information is used for predication and delay slots.
Code locality: static function ordering
Look into SGI's tool CORD to determine whether its techniques can be used with GCC.
Inlining: improve the heuristics used to guide inlining with -O3
Some of this was done in the summer of 2001 and is in GCC 3.1. There might be more work that could be done here.
Improve the machine model
Validate that the machine model in GCC is accurate. This would be most useful when specific problems are noticed in generated code, rather than making a full pass through it.

Look into incorporating information from Intel's KAPI library into the machine model in GCC.
Improve GCC instruction bundling
The machine model should guide instruction bundling, but currently it is done using ad-hoc methods.

To evaluate instruction bundling, look at nop density.
Register allocator knowledge of hidden RSE costs
The register allocator needs to know that there is some cost in allocating additional stack registers because there's the danger of hidden spilling in the Register Stack Engine (RSE) at the time of a call.
Control speculation for loads
This doesn't require recovery code and is quite simple, with chk.s.
Straight-line post-increment
Turning off the current support actually makes faster code for IA-64, since it tends to create extra dependencies. For it to be used effectively post-increment could be generated after the second scheduling pass, with a third pass then required.

Post-increment could be used when optimizing for size.

Exploit opportunities for non-loop induction variables.
Enable branch target alignment
It's necessary to measure the trade-offs between alignment and code size.
Align procedures
This isn't turned on for IA-64; again, measure the trade-offs.
Tuning for Itanium 2
Tuning for Itanium 2, controlled by -mtune, should be added.
Double test conversion
Jan Hubicka added support to the mainline (to become 3.2) to do branch combining of chained branches having the same destination, with hooks for target-specific tricks. Such tricks are expected to be worthwhile for IA-64; see the thread in the gcc-patches archives.

Long-term and infrastructure projects

Region formation heuristics
John Sias explains: "Region formation is a way of coping with either limitations of the machine or limitations of the compiler / compile time. "Regions" are control-flow-subgraphs, formed by various heuristics, usually to perform transformations (i.e. hyperblock formation) or to do register allocation or other work-intensive things. For hyperblock formation, for example, region formation heuristics are critical---selecting too much unrelated code wastes resources; conversely, missing important paths that interact well with each other defeats the purpose of the transformation. Large functions are sometimes broken heuristically into regions for compilation, with the goal of reducing compile time."

Richard Henderson says we could rip out the Haifa scheduler's CFG detection, use regular data structures, and fix region detection.
Language-independent tree optimizations
Now that the tree-ssa branch has been merged into mainline, we can perform cool optimizations that require more information than is available in RTL.
Inlining: use profile information to guide inlining
The infrastructure for this is not yet available.
High-level loop optimizations
The lno-branch can perform many high-level loop optimisations.
Hyperblock scheduling
This requires highly predicated code.
Predication
There is little or no knowledge of predication outside of the if-cvt.c file, so there are a number of optimization passes that are suboptimal when predicated code is present. None of the optimization passes up to and including register allocation know how to handle predication from a correctness standpoint.
- if-conversion
- finding longer strings of logical
- PQS (Predicate Query System)
- disjoint predicates
PQS is a database of known relationships between predicates. It would underlie predicate-aware dataflow, and therefore dependence drawing and register allocation.
Data speculation
Bernd Schmidt made an unsuccessful attempt to add data speculation. Completing the patch won't be worthwhile until there is a sufficient amount of ILP.

The IBM IA-64 compiler team saw code in important applications that could have benefited from very local data speculation; see comments by Jim McInnes in the minutes of the GCC IA-64 Summit.
Control speculation
Control speculation is more important than data speculation. It needs cross-block scheduling, since the compiler doesn't see the opportunity or need within a basic block. Both require generating recovery code, which introduces new instructions and new register definitions and uses. It might be difficult to build in.

Some people at Red Hat tried unsuccessfully to tie control speculation into the Haifa scheduler, but the effort showed that alias analysis in GCC during scheduling is extremely weak. One problem was that it couldn't even tell which addresses are from the stack frame and so it would speculate too much. This project was tried quite quickly, though, and with more time such a project might be successful. Since then, Richard Kenner has added support for tracking memory origin, so this might be more successful now.

Bernd Schmidt might have an unfinished patch that could be picked up.

Stan Cox also had an unfinished control speculation patch.
Modulo scheduling
Rotating registers
Function splitting (moving function into two regions), for locality
This is difficult if an exception is involved.

Dwarf2 is the only debugging format that can handle this.
Optimization of structures that are larger than a register
The infrastructure doesn't currently handle this. This is related to memory optimizations.
Instruction prefetching
Data prefetching is mentioned under short-term projects. Instruction prefetching requires additional infrastructure.
Use of BBB template for multi-way branches (e.g. switches)
It might be difficult to keep track of this in the machine-independent part of GCC.
Cross-module optimizations
Avoid reloads of GP when it is not necessary. The compiler needs more information than is currently available.
Register allocator handling GP as special
C++ optimizations
Jason Merrill invented cool stuff, e.g. thunks for multiple inheritance, that hasn't been done yet.

It's possible to inline stubs.
"external" attribute or pragma
This would be for information like DLL import/export; it is not machine independent.

If GCC defined such an attribute, glibc would probably use it.

Tools: performance tools, benchmarks, etc.

Analyze benchmark results to identify important optimizations
One of the projects identified at the GCC IA-64 Summit is measuring the performance of GCC on IPF, comparing it to other IPF compilers, and identifying the reasons for performance differences. This would enable the limited developer resources to be spent on those improvements that are most likely to affect the performance of the applications that are identified as being important.

This project can be broken up into a number of tasks that can be performed by separate teams to best utilize the experience and strengths of each team.
- Determine which benchmarks most accurately reflect the performance of Linux system software and the enterprise applications that are most likely to be used on IPF platforms (or software of interest to you).
- Run the selected benchmarks on other architectures with GCC.
- Run the selected benchmarks on Itanium with GCC and other IPF compilers (Intel, HP, SGI).
- Determine which significant sections of benchmark code show the worst relative performance of GCC on Itanium.
- Analyze the assembly code generated by GCC and by the proprietary IPF compilers to determine, where possible, which optimizations would most improve the performance of GCC code.
- Pass on information to GCC developers about the relative value of various short-term and long-term optimizations.
Benchmark specific optimizations
Run benchmarks with GCC for IPF with a variety of options for specific optimizations to determine which ones should be included with gcc -O2.
Profile the Linux kernel
Profile the kernel and look for hot spots where better code generation or optimization would make a significant difference.

Gary Hade at IBM has been collecting profile and coverage data for 2.4.18 IA-64 Linux kernels built with prerelease versions of GCC 3.1. Profile data collection utilizes the SGI-donated Kernprof facility. Coverage data collection utilizes the IBM-donated GCOV kernel facility. The data is being generated under various system loads including parallel Linux kernel builds, AIM Suite VII Benchmark, and the SPEC SDET Benchmark. Much of the data has already been collected but it still needs to be analyzed.

Information and code for the data collection facilities and workloads mentioned above is available from these sources:
- Kernprof: http://oss.sgi.com/projects/kernprof/
- SPEC SDET (proprietary benchmark): description at specbench web site
Dispersal analysis
Steve Christiansen wrote a dispersal analysis tool that uses objdump disassembly output. It uses McKinley rules and cannot be distributed outside of IBM.

Gary Hade added Itanium 1 support to Steve's dispersal analysis code and integrated the code into GNU Binutils source so it can be invoked and controlled from objdump using IA-64 specific disassembler options. The results of the optional dispersal analysis are added to the disassembly output. Gary submitted a patch supporting Itanium 1 to the Binutils project via the bug-binutils mailing list on June 6, 2002. An update adding Itanium 2 support can be provided after Intel makes the McKinley information public.
Statistics gathering tool
PMU-based performance monitor
Small test cases and sample codes for examining generated code
Developers of proprietary IPF compilers who have identified key code fragments from real applications where IPF optimizations make a big difference could share these with GCC developers.
Compiler instrumentation that would cause an application to dump performance counter information
Fix profiling tools so they work with threads
This would allow a tool that uses profiling output to order functions to be used with a wider variety of applications.

For questions related to the use of GCC, please consult these web pages and the GCC manuals. If that fails, the gcc-help@gcc.gnu.org mailing list might help. Comments on these web pages and the development of GCC are welcome on our developer list at gcc@gcc.gnu.org. All of our lists have public archives.

Copyright (C) Free Software Foundation, Inc. Verbatim copying and distribution of this entire article is permitted in any medium, provided this notice is preserved.

These pages are maintained by the GCC team. Last modified 2025-01-31.

Projects to improve performance on IA-64

Contents:

Short-term projects

Long-term and infrastructure projects

Tools: performance tools, benchmarks, etc.