Auto-vectorization in GCC
The goal of this project was to develop a loop and basic block
vectorizer in GCC, based on the tree-ssa framework.
It has been completed and the functionality has been part of GCC
for years.
Table of Contents
Latest News
- 2011-10-23
-
- Vectorization of reduction in loop SLP.
Both multiple reduction cycles and
reduction chains are supported.
- Various basic block vectorization (SLP)
improvements, such as
better data dependence analysis, support of misaligned accesses
and multiple types, cost model.
- Detection of vector size:
https://gcc.gnu.org/ml/gcc-patches/2010-10/msg00441.html.
- Vectorization of loads with negative step.
- Improved realignment scheme:
https://gcc.gnu.org/ml/gcc-patches/2010-06/msg02301.html.
- A new built-in,
__builtin_assume_aligned
, has been added,
through which the compiler can be hinted about pointer alignment.
- Support of strided accesses using
memory instructions that have
the interleaving "built in", such as NEON's vldN and vstN.
- The vectorizer now attempts to reduce over-promotion of operands in some vector
operations:
https://gcc.gnu.org/ml/gcc-patches/2011-07/msg01472.html.
- Widening shifts are now detected and vectorized
if supported by the target.
- Vectorization of conditions with mixed types.
- Support of loops with bool.
- 2009-12-03
- The following new vectorization features were committed to mainline:
- vectorization of conditions in nested loop (2009-07-20)
- vectorization of double reduction (reductions carried
by two nested loops) (2009-07-12)
- vectorization of nested cycles (dependence cycles
other than reduction cycles in nested loops) (2009-06-16)
- support of misaligned stores for platforms that allow misaligned
access (2009-06-05)
- basic block SLP (vectorization of straight line
code) (2009-05-25)
- avoid versioning of vectorized loop if possible (2009-04-02)
(https://gcc.gnu.org/ml/gcc-patches/2009-03/msg01784.html).
- support load permutations in loop-aware
SLP (2008-08-25)
- support multiple types in loop-aware SLP (2008-08-19)
- 2008-08-11
-
- Vectorization supports integer type conversions also when one type is more
than two times bigger than the other (e.g. char to int) (2008-08-11).
UNITS_PER_SIMD_WORD
can be different for different scalar
types (2008-05-22).
- Vector shifts by a vector shift amount differentiated from vector
shifts with scalar shift amount (2008-05-14).
- Complete unrolling enabled before vectorization, relying on
intra-iteration vectorization (aka SLP) to vectorize unrolled loops (2008-04-27).
- Further refinements to the cost model (2007-12-06).
-ftree-vectorize
is turned on under -O3
(2007-09-18).
Contributing
This project was started by Dorit (Naishlos) Nuzman. Current contributors
to this project include Revital Eres, Richard Guenther, Jakub Jelinek, Michael Matz,
Richard Sandiford, and Ira Rosen.
This web page
is maintained by Ira Rosen <irar@il.ibm.com>.
For a list of missing features and possible enhancements see
https://gcc.gnu.org/wiki/VectorizationTasks.
Using the Vectorizer
Vectorization is enabled by the flag
-ftree-vectorize
and by default
at -O3
. To allow vectorization on
powerpc* platforms also use -maltivec
. On i?86 and
x86_64 platforms use -msse/-msse2
. To enable vectorization
of floating point reductions use -ffast-math
or
-fassociative-math
.
The vectorizer
test cases demonstrate the current vectorization capabilities;
these can be found under
gcc/gcc/testsuite/gcc.dg/vect/. Information on which
loops were or were not vectorized and why, can be obtained
using the flag -ftree-vectorizer-verbose
. For details
see
https://gcc.gnu.org/ml/gcc-patches/2005-01/msg01247.html.
Example output using -ftree-vectorizer-verbose=2
:
vect-1.c:82: note: not vectorized, possible dependence between data-refs a[i_124] and a[i_83]
vect-1.c:72: note: LOOP VECTORIZED.
vect-1.c:64: note: LOOP VECTORIZED.
vect-1.c:56: note: LOOP VECTORIZED.
vect-1.c:49: note: LOOP VECTORIZED.
vect-1.c:41: note: not vectorized: unsupported use in stmt.
vect-1.c:31: note: not vectorized: unsupported use in stmt.
vect-1.c:13: note: vectorized 4 loops in function.
Basic block vectorization, aka SLP, is enabled by the flag
-ftree-slp-vectorize
, and requires the same platform dependent flags
as loop vectorization. Basic block SLP is enabled by default at -O3
and when -ftree-vectorize
is enabled.
Vectorizable Loops
"feature" indicates
the vectorization capabilities demonstrated by the
example.
Example 1:
int a[256], b[256], c[256];
foo () {
int i;
for (i=0; i<256; i++){
a[i] = b[i] + c[i];
}
}
Example 2:
int a[256], b[256], c[256];
foo (int n, int x) {
int i;
/* feature: support for unknown loop bound */
/* feature: support for loop invariants */
for (i=0; i<n; i++)
b[i] = x;
}
/* feature: general loop exit condition */
/* feature: support for bitwise operations */
while (n--){
a[i] = b[i]&c[i]; i++;
}
}
Example 3:
typedef int aint __attribute__ ((__aligned__(16)));
foo (int n, aint * __restrict__ p, aint * __restrict q) {
/* feature: support for (aligned) pointer accesses. */
while (n--){
*p++ = *q++;
}
}
Example 4:
typedef int aint __attribute__ ((__aligned__(16)));
int a[256], b[256], c[256];
foo (int n, aint * __restrict__ p, aint * __restrict__ q) {
int i;
/* feature: support for (aligned) pointer accesses */
/* feature: support for constants */
while (n--){
*p++ = *q++ + 5;
}
/* feature: support for read accesses with a compile time known misalignment */
for (i=0; i<n; i++){
a[i] = b[i+1] + c[i+3];
}
/* feature: support for if-conversion */
for (i=0; i<n; i++){
j = a[i];
b[i] = (j > MAX ? MAX : 0);
}
}
Example 5:
struct a {
int ca[N];
} s;
for (i = 0; i < N; i++)
{
/* feature: support for alignable struct access */
s.ca[i] = 5;
}
Example 6: gfortran:
DIMENSION A(1000000), B(1000000), C(1000000)
READ*, X, Y
A = LOG(X); B = LOG(Y); C = A + B
PRINT*, C(500000)
END
Example 7:
int a[256], b[256];
foo (int x) {
int i;
/* feature: support for read accesses with an unknown misalignment */
for (i=0; i<N; i++){
a[i] = b[i+x];
}
}
Example 8:
int a[M][N];
foo (int x) {
int i,j;
/* feature: support for multidimensional arrays */
for (i=0; i<M; i++) {
for (j=0; j<N; j++) {
a[i][j] = x;
}
}
}
Example 9:
unsigned int ub[N], uc[N];
foo () {
int i;
/* feature: support summation reduction.
note: in case of floats use -funsafe-math-optimizations */
unsigned int diff = 0;
for (i = 0; i < N; i++) {
udiff += (ub[i] - uc[i]);
}
Example 10:
/* feature: support data-types of different sizes.
Currently only a single vector-size per target is supported;
it can accommodate n elements such that n = vector-size/element-size
(e.g, 4 ints, 8 shorts, or 16 chars for a vector of size 16 bytes).
A combination of data-types of different sizes in the same loop
requires special handling. This support is now present in mainline,
and also includes support for type conversions. */
short *sa, *sb, *sc;
int *ia, *ib, *ic;
for (i = 0; i < N; i++) {
ia[i] = ib[i] + ic[i];
sa[i] = sb[i] + sc[i];
}
for (i = 0; i < N; i++) {
ia[i] = (int) sb[i];
}
Example 11:
/* feature: support strided accesses - the data elements
that are to be operated upon in parallel are not consecutive - they
are accessed with a stride > 1 (in the example, the stride is 2): */
for (i = 0; i < N/2; i++){
a[i] = b[2*i+1] * c[2*i+1] - b[2*i] * c[2*i];
d[i] = b[2*i] * c[2*i+1] + b[2*i+1] * c[2*i];
}
Example 12: Induction:
for (i = 0; i < N; i++) {
a[i] = i;
}
Example 13: Outer-loop:
for (i = 0; i < M; i++) {
diff = 0;
for (j = 0; j < N; j+=8) {
diff += (a[i][j] - b[i][j]);
}
out[i] = diff;
}
}
Example 14: Double reduction:
for (k = 0; k < K; k++) {
sum = 0;
for (j = 0; j < M; j++)
for (i = 0; i < N; i++)
sum += in[i+k][j] * coeff[i][j];
out[k] = sum;
}
Example 15: Condition in nested loop:
for (j = 0; j < M; j++)
{
x = x_in[j];
curr_a = a[0];
for (i = 0; i < N; i++)
{
next_a = a[i+1];
curr_a = x > c[i] ? curr_a : next_a;
}
x_out[j] = curr_a;
}
Example 16: Load permutation in loop-aware SLP:
for (i = 0; i < N; i++)
{
a = *pInput++;
b = *pInput++;
c = *pInput++;
*pOutput++ = M00 * a + M01 * b + M02 * c;
*pOutput++ = M10 * a + M11 * b + M12 * c;
*pOutput++ = M20 * a + M21 * b + M22 * c;
}
Example 17: Basic block SLP:
void foo ()
{
unsigned int *pin = &in[0];
unsigned int *pout = &out[0];
*pout++ = *pin++;
*pout++ = *pin++;
*pout++ = *pin++;
*pout++ = *pin++;
}
Example 18: Simple reduction in SLP:
int sum1;
int sum2;
int a[128];
void foo (void)
{
int i;
for (i = 0; i < 64; i++)
{
sum1 += a[2*i];
sum2 += a[2*i+1];
}
}
Example 19: Reduction chain in SLP:
int sum;
int a[128];
void foo (void)
{
int i;
for (i = 0; i < 64; i++)
{
sum += a[2*i];
sum += a[2*i+1];
}
}
Example 20: Basic block SLP with
multiple types, loads with different offsets, misaligned load,
and not-affine accesses:
void foo (int * __restrict__ dst, short * __restrict__ src,
int h, int stride, short A, short B)
{
int i;
for (i = 0; i < h; i++)
{
dst[0] += A*src[0] + B*src[1];
dst[1] += A*src[1] + B*src[2];
dst[2] += A*src[2] + B*src[3];
dst[3] += A*src[3] + B*src[4];
dst[4] += A*src[4] + B*src[5];
dst[5] += A*src[5] + B*src[6];
dst[6] += A*src[6] + B*src[7];
dst[7] += A*src[7] + B*src[8];
dst += stride;
src += stride;
}
}
Example 21: Backward access:
int foo (int *b, int n)
{
int i, a = 0;
for (i = n-1; i ≥ 0; i--)
a += b[i];
return a;
}
Example 22: Alignment hints:
void foo (int *out1, int *in1, int *in2, int n)
{
int i;
out1 = __builtin_assume_aligned (out1, 32, 16);
in1 = __builtin_assume_aligned (in1, 32, 16);
in2 = __builtin_assume_aligned (in2, 32, 0);
for (i = 0; i < n; i++)
out1[i] = in1[i] * in2[i];
}
Example 23: Widening shift:
void foo (unsigned short *src, unsigned int *dst)
{
int i;
for (i = 0; i < 256; i++)
*dst++ = *src++ << 7;
}
Example 24: Condition with mixed types:
#define N 1024
float a[N], b[N];
int c[N];
void foo (short x, short y)
{
int i;
for (i = 0; i < N; i++)
c[i] = a[i] < b[i] ? x : y;
}
Example 25: Loop with bool:
#define N 1024
float a[N], b[N], c[N], d[N];
int j[N];
void foo (void)
{
int i;
_Bool x, y;
for (i = 0; i < N; i++)
{
x = (a[i] < b[i]);
y = (c[i] < d[i]);
j[i] = x & y;
}
}
Unvectorizable Loops
Examples of loops that currently cannot be
vectorized:
Example 1: Uncountable loop:
while (*p != NULL) {
*q++ = *p++;
}
Also see
https://gcc.gnu.org/wiki/VectorizationTasks
and a list of vectorizer missed-optimization PRs in the GCC bug tracker.
Previous News and Status
- 2007-09-17
-
-ftree-vectorize
is going to be turned on under
-O3
.
- Cost-model tweaks and x86_64 specific costs committed to mainline
(2007-09-10).
- Vectorization that exploits intra-iteration parallelism (ala SLP)
was committed to mainline (2007-09-09).
-fassociative-math
can be used instead of
-ffast-math
to enable vectorization of reductions of
floats (2007-09-04).
- Initial support for vectorization of outer-loops (doubly nested
loops) was committed to mainline (2007-08-19).
- Run-time dependence testing using loop-versioning was committed
to mainline (2007-08-16).
- 2007-07-25
- The new vectorization feature that exploits intra-iteration parallelism
(to be submitted to GCC 4.3) was presented at the GCC summit last week ("Loop-aware SLP").
Also at the summit, we held a BOF on vectorization and other loop optimizations. The summary
of the BOF can be found here:
https://gcc.gnu.org/wiki/LoopOptimizationsBOF. Following the BOF the vectorizer's todo list
was also updated (
https://gcc.gnu.org/wiki/VectorizationTasks).
- Mainline updates:
- SPU specific costs for the cost model committed to mainline (2007-07-12).
Tuning for other platforms (PPC, x86) ongoing.
- Initial cost model implementation committed to mainline
(2007-06-08).
- Vectorization of fp/integer conversions of different sizes (e.g. float/short)
committed to mainline (2007-05-17).
- Data-refs analysis was rewritten and improved (2007-05-13).
- Autovect-branch updates:
- Outer-loop vectorization was enhanced to support aligned and unaligned
memory references in the inner-loop, using the optimized realignment
scheme when possible.
- Vectorization that exploits intra-iteration parallelism (ala SLP)
was added to the vectorizer (that so far exploited only inter-iteration
parallelism).
- The vectorizer cost model was extended to support the above two new vectorization
features (outer-loop and "SLP").
- 2007-05-09
-
- Vectorization of fp/integer conversions of different sizes (e.g. float/short)
is soon to be committed to mainline.
- Initial cost model implementation is soon to be committed to
mainline.
- 2007-04-22
-
- Vectorization of float/double conversions was added to mainline.
- Initial vectorization support for certain forms of outer-loops (doubly
nested loops) was added to autovect-branch.
- 2007-02-21
-
- Vectorization of float/int conversions added to mainline.
- Vectorization of inductions added to mainline.
- Vectorization of function-calls added to mainline.
- Improvements to vectorization of strided-accesses - added to autovect-branch.
- Discussion on building a cost-model for the vectorizer - started on the mailing list.
- 2007-01-14
-
- A new flag to limit vectorization to loops with large enough loop-bound was
added:
--param min-vect-loop-bound=X
prevents
vectorization of loops whose vectorized loop-bound is equal or less than
X
.
- Autovect branch has been reopened:
(
https://gcc.gnu.org/ml/gcc/2007-01/msg00117.html).
- 2006-11-27
-
- Vectorization of loops that operate on multiple data-types, including type
conversions: incorporated into GCC 4.3.
- Vectorization of non consecutive (non-unit-stride) data-accesses with power-of-2 strides:
incoporated into GCC 4.3.
- Additional vectorizer related projects planned for GCC 4.3 can be found here:
(
https://gcc.gnu.org/wiki/AutovectBranchOptimizations).
- 2006-02-19
-
- Vectorization of loops that operate on multiple data-types, including type
conversions: submitted for incorporation into GCC 4.2.
- Detection and vectorization of special idioms, such as dot-product and widening-summation:
Incorporated into GCC 4.2.
- Vectorization of non consecutive (non-unit-stride) data-accesses with power-of-2 strides:
Incoporated into autovect-branch. To be submitted to GCC 4.2.
- 2005-10-23
- Autovect-branch has been enhanced to support the following features:
- Vectorization of loops that operate on multiple data-types, including type
promotion (conversion to a wider type) and type demotion (conversion to a narrower
type). Type promotion is supported using the new
VEC_UNPACK_HI
and VEC_UNPACK_LO
tree-codes (and the new vec_unpacks_hi/lo
and vec_unpacku_hi/lo
optabs). Type
demotion is supported using the new VEC_PACK_MOD
tree-code (and the new vec_pack_mod
optab).
- Vectorization of idioms that involve type conversion. This allows more efficient
vectorization (if specialized target support is available) that avoids the data
packing/unpacking that is otherwise required to handle multiple data-types. These
idioms include: widening-summation (
WIDEN_SUM
), dot-product (DOT_PROD
),
widening-multiplication (WIDEN_MULT
, VEC_WIDEN_MULT_HI/LO
), multiply-highpart
(MULT_HI
) and sum-of-absolute-differences (SAD
).
- 2005-08-11
- The following enhancements have been incorpoated into GCC 4.1:
- Vectorization of reduction has been introduced and currently supports summation,
and minimum/maximum computation.
- Vectorization of conditional code has been introduced.
- The mechanism of peeling a loop to force the alignment of data accesses has been
improved. We now generate better code when the misalignment of an access is known at
compile time, or when different accesses are known to have the same
misalignment, even if the misalignment amount itself is unknown.
- Dependence distance is considered when checking dependences between data references.
- Loop-closed-ssa-form is incrementally updated during vectorization.
- Generic parts of the data references analysis were cleaned up and externalized to make
this analysis available to other passes.
- 2005-04-05
- Vectorization of reduction on autovect-branch was enhanced to support
maximum/minimum computations, and special reduction idioms such as widening
summation as a step towards supporting patterns like dot-product, sum of
absolute differences and more:
(
https://gcc.gnu.org/ml/gcc-patches/2005-04/msg00532.html).
- 2005-03-01
- Vectorization projects for GCC 4.1: See
https://gcc.gnu.org/wiki/Autovectorization_Enhancements.
- Vectorization capabilities in GCC 4.0: See
2005-03-01 mainline status.
- 2005-02-25
- New features:
- Summation is the first reduction idiom that is vectorized
(autovect-branch only).
- Verbosity levels for vectorization reports.
- Improved line number information.
- Revised data-references analysis.
- 2005-03-01, autovect-branch
-
Description of vectorizable loops:
- Vectorization is restricted to inner most countable loops,
in which all operations operate on data types of the same size,
and all memory accesses are consecutive.
- Certain forms of conditional code.
- Unaligned memory accesses are handled using loop peeling,
or loop versioning, or direct misalignment support.
- Summation reduction is supported (
sum += a[i]
).
- Constants and invariants are supported
(
a[i] = 5
, a[i] = x
).
- Vectorization of the subtract-and-saturate idiom is supported.
- 2005-03-01, mainline (final 4.0 status)
-
Description of vectorizable loops:
- Vectorization is restricted to inner most countable loops,
in which all operations operate on data types of the same size,
and all memory accesses are consecutive.
- Unaligned memory write accesses are handled using loop peeling.
This allows vectorization when there's only a single unaligned
memory-write (or all memory-writes in the loop have the same
misalignment). Unaligned memory read accesses are handled using
direct misalignment support. This support is
currently available for Alpha ev6, mmx, and altivec.
- Constants and invariants are supported
(
a[i] = 5
, a[i] = x
).
- No reductions (
sum += a[i]
) or inductions
(a[i] = i
).
- -ftree-vectorizer-verbose=[n] controls vectorization reports,
with n ranging from 0 (no information reported) to 6 (all information
reported).
- 2005-02-02
- New features that are only in autovect-branch:
- Incrementally preserve loop closed SSA form during vectorization.
- Loop versioning guarded by a runtime alignment test.
- Idiom recognition, and vectorization of the subtract-and-saturate idiom.
- Incrementally preserve SSA form during vectorization.
- Improved handling of misalignment in case it is known at compile time,
or in case multiple accesses are known to have the same misalignment.
- Vectorization of conditional code.
- Consider dependence distance.
- 2004-10-27, mainline
-
Description of vectorizable loops:
- Inner most loops that consist of a single
basic block (i.e, straight-line code, no if-then-else).
- New: No restrictions on the loop bound.
An epilog loop is generated to handle unknown loop bounds
and loop bounds that do not divide by the vectorization factor.
- Supported memory accesses are
multidimensional arrays, and pointers that
are annotated as
__restrict__
.
- All memory accesses are consecutive (stride=1).
- New: Loops with a single unaligned write to memory
(store). This is supported by peeling the loop to
force the alignment of the store.
- Reads from memory (loads) can be unaligned. This support is
currently available for Alpha ev6, mmx, and altivec.
- All operations operate on data types of the same
size.
- No reductions (
sum += a[i]
) or inductions
(a[i] = i
).
- Constants and invariants are supported
(
a[i] = 5
, a[i] = x
).
- 2004-11-10
- New branch for vectorization development opened: autovect-branch.
lno-branch was retired.
(
https://gcc.gnu.org/ml/gcc-patches/2004-11/msg00852.html).
- 2004-10-27
- Mainline merge in progress. Last pending vectorization patches for
mainline:
- Support for vectorization of conditional code
(
https://gcc.gnu.org/ml/gcc-patches/2004-09/msg01768.html).
- Consider dependence distance
(
https://gcc.gnu.org/ml/gcc-patches/2004-10/msg01046.html).
- 2004-10-14
- Support for unknown loop bound
(
https://gcc.gnu.org/ml/gcc-patches/2004-09/msg01104.html,
vectorizer merge part #2) committed to mainline.
- Peeling for alignment
(
https://gcc.gnu.org/ml/gcc-patches/2004-09/msg01894.html,
vectorizer merge part #5) committed to mainline.
- 2004-09-27, mainline
-
Description of vectorizable loops:
- Inner most loops that consist of a single
basic block (i.e, straight-line code, no if-then-else).
- The loop bound (number of iterations) is known and
divides by the vectorization factor.
- New: Supported memory accesses are
multi-dimensional arrays, and pointers that
are annotated as
__restrict__
.
- New: Additional forms of accesses are supported:
Restrictions on the the initial value of the
access function of pointers and array indexes have been
relaxed. These can now take a form like
p=&a[16]-4B
(pointer), and
a[i+off]
(arrays).
- All memory accesses are consecutive (stride=1).
- Writes to memory (stores) are aligned.
- New: Reads from memory (loads) can be
unaligned.
- All operations operate on data types of the same
size.
- No reductions (
sum += a[i]
) or inductions
(a[i] = i
).
- Constants and invariants are supported
(
a[i] = 5
, a[i] = x
).
- 2004-09-23
- Support for unaligned loads
(
https://gcc.gnu.org/ml/gcc-patches/2004-09/msg01522.html,
vectorizer merge part #4) committed to mainline.
- 2004-09-19
- Support for additional forms of data references
(
https://gcc.gnu.org/ml/gcc-patches/2004-09/msg01521.html,
vectorizer merge part #3) committed to mainline.
- 2004-09-13, lno-branch
-
Description of vectorizable loops:
- Inner most loops that consist of a single
basic block (i.e, straight-line code, no
if-then-else).
- The loop has to be countable - i.e, the number
of iterations can be evaluated before the loop
starts to execute.
- Supported memory accesses are multi-dimensional arrays,
and pointers that are annotated as
__restrict__
.
- New: Additional forms of accesses are supported:
Restrictions on the the initial value of the access
function of pointers and array indexes have been relaxed.
These can now take a form like
p=&a[16]-4B
(pointer), and a[i+off]
(arrays).
- All memory accesses are consecutive (stride=1)
and aligned.
- Writes to memory (stores) are aligned.
- New: Reads from memory (loads) can be
unaligned.
- Loop versioning for alignment is applied to unaligned
stores.
- All operations operate on data types of the
same size.
- No reductions (
sum += a[i]
) or
inductions (a[i] = i
).
- Constants and invariants are supported
(
a[i] = 5
, a[i] = x
).
- 2004-09-02, lno-branch
-
Description of vectorizable loops:
- Inner most loops that consist of a single
basic block (i.e, straight-line code, no
if-then-else).
- The loop has to be countable - i.e, the number
of iterations can be evaluated before the loop
starts to execute.
- New: The loop bound (number of
iterations) can be unknown at compile time, or
can be known but not divisible by the vectorization
factor.
- New: Supported memory accesses are
multi-dimensional arrays, and pointers
that are annotated as
__restrict__
- All memory accesses are consecutive (stride=1).
- New: Loop versioning for alignment:
in the presence of unaligned accesses create two versions
of the loop controlled by a runtime alignment check. In one
version all the accesses are guaranteed to be aligned, and
it can therefore be vectorized. In the second version there
may be unaligned accesses, and it remains unvectorized.
- All operations operate on data types of the
same size.
- No reductions (
sum += a[i]
) or
inductions (a[i] = i
).
- Constants and invariants are supported
(
a[i] = 5
, a[i] = x
).
- 2004-08-17, mainline
-
Description of vectorizable loops:
- Inner most loops that consist of a single
basic block (i.e, straight-line code, no
if-then-else).
- The loop bound (number of iterations) is known
and divides by the vectorization factor.
- Supported memory accesses are one-dimensional
arrays, whose alignment can be forced (not
extern
,
and stack boundary of target platform allows it),
and aligned pointers that are annotated as
__restrict__
.
- All memory accesses are consecutive (stride=1)
and aligned.
- Supportable operations include plus/minus/mult,
as well as bitwise operations -
and/or/xor/1's-complement, according to available
vector support on the target platform.
- All operations operate on data types of the
same size.
- No reductions (
sum += a[i]
) or
inductions (a[i] = i
).
- Constants and invariants are supported
(
a[i] = 5
, a[i] =
x
).
Examples of vectorizable loops: loop, loop
- 2004-08-17, lno-branch
-
Description of vectorizable loops:
- Inner most loops that consist of a single
basic block (i.e, straight-line code, no
if-then-else).
- The loop has to be countable - i.e, the number
of iterations can be evaluated before the loop
starts to execute.
- The loop bound (number of iterations) can be
unknown.
- New: Supported memory accesses
are multidimensional arrays, whose
alignment can be forced (not extern), and aligned
pointers that are annotated as
__restrict__
.
- All memory accesses are consecutive (stride=1)
and aligned.
- Supportable operations include plus/minus/mult,
as well as bitwise operations -
and/or/xor/1's-complement, according to available
vector support on the target platform.
- All operations operate on data types of the
same size.
- No reductions (
sum += a[i]
) or
inductions (a[i] = i
).
- Constants and invariants are supported
(
a[i] = 5
, a[i] =
x
).
Examples of newly vectorizable loops: loop
- 2004-08-17
- Initial vectorization functionality committed to mainline
(
https://gcc.gnu.org/ml/gcc-patches/2004-08/msg01219.html,
https://gcc.gnu.org/ml/gcc-patches/2004-07/msg02127.html,
vectorizer merge part #1).
- 2004-07-20, lno-branch
-
Description of vectorizable loops:
- Inner most loops that consist of a single
basic block (i.e, straight-line code, no
if-then-else).
- The loop has to be countable - i.e, the number
of iterations can be evaluated before the loop
starts to execute.
- The loop bound (number of iterations) can be
unknown.
- New: Supported memory accesses
are one-dimensional arrays, whose base can
be a struct field, and whose alignment can
be forced (not extern arrays), and aligned pointers
that are annotated as
__restrict__
.
- All memory accesses are consecutive (stride=1)
and aligned. Arrays are alignable
- Supportable operations include plus/minus/mult,
as well as bitwise operations -
and/or/xor/1's-complement, according to available
vector support on the target platform.
- All operations operate on data types of the
same size.
- No reductions (
sum += a[i]
) or
inductions (a[i] = i
).
- Constants and invariants are supported
(
a[i] = 5
, a[i] =
x
).
- New: first gfortran program
vectorized.
Examples of newly vectorizable loops: loop, loop
- 2004-06-25, apple-ppc-branch
-
Description of vectorizable loops:
- Inner most loops that consist of a single
basic block (i.e, straight-line code, no
if-then-else).
- The loop has to be countable - i.e, the number
of iterations can be evaluated before the loop
starts to execute.
- The loop bound (number of iterations) can be
unknown.
- Supported memory accesses are one-dimensional
arrays, and pointers. If more than one memory
access is present in the loop, any pointers that
are used in the loop have to be annotated as
__restrict__
.
- Store (memory-write) accesses have to be
aligned.
- New: Loads (memory-reads) can
be unaligned by an unknown amount (e.g.
access
a[i+x]
, where the value of x is
unknown). Misalignment support for
loads was also made more efficient.
- All memory accesses are consecutive
(stride=1).
- Supportable operations include plus/minus/mult,
as well as bitwise operations -
and/or/xor/1's-complement, according to available
vector support on the target platform.
- All operations operate on data types of the
same size.
- Some forms of if-then-else patterns can be
vectorized.
- New: Infrastructure for idiom
recognition has been added. The first computation
idiom that is recognized and vectorized is a
multiplication of unsigned chars, whose (unsigned
short) product is converted back to unsigned char
(a similar computation comes up in pixel blending;
we support a simplified version that does not
require operating on different data
types).
- No reductions (
sum += a[i]
) or
inductions (a[i] = i
).
- Constants and invariants are supported
(
a[i] = 5
, a[i] = x
).
New: Support for invariants was made more
efficient.
Examples of newly vectorizable loops: loop
- 2004-06-23, lno-branch
-
Description of vectorizable loops:
- Inner most loops that consist of a single
basic block (i.e, straight-line code, no
if-then-else).
- The loop has to be countable - i.e, the number
of iterations can be evaluated before the loop
starts to execute.
- The loop bound (number of iterations) can be
unknown.
- Supported memory accesses are one-dimensional
arrays, whose alignment can be forced (not extern
arrays).
- New: Memory accesses can also be
pointer based. If more than one memory access is
present in the loop, any pointers that are used in
the loop have to be annotated as
__restrict__
. The pointers have to
point to an aligned address.
- All memory accesses are consecutive (stride=1)
and aligned.
- Supportable operations include plus/minus/mult,
as well as bitwise operations -
and/or/xor/1's-complement, according to available
vector support on the target platform.
- All operations operate on data types of the
same size.
- No reductions (
sum += a[i]
) or
inductions (a[i] = i
).
- Constants and invariants are supported
(
a[i] = 5
, a[i] =
x
).
Examples of newly vectorizable loops: loop
- 2004-06-17, apple-ppc-branch
-
Description of vectorizable loops:
- Inner most loops that consist of a single
basic block (i.e, straight-line code, no
if-then-else).
- The loop has to be countable - i.e, the number
of iterations can be evaluated before the loop
starts to execute.
- The loop bound (number of iterations) can be
unknown.
- Memory accesses are one dimensional-arrays,
whose alignment can be forced (not extern
arrays).
- New: Memory accesses can also be
pointer based. If more than one memory access is
present in the loop, any pointers that are used in
the loop have to be annotated as
__restrict__
. (new experimental
feature)
- New: Loads (memory reads) can be
unaligned by a known amount (e.g. access
a[i+1]
, where array a
is
aligned and i
starts from 0).
Stores (memory writes) still have to be
aligned.
- All memory accesses are consecutive
(stride=1).
- Supportable operations include plus/minus/mult,
as well as bitwise operations -
and/or/xor/1's-complement, according to available
vector support on the target platform.
- All operations operate on data types of the
same size.
- New: Some forms of if-then-else
patterns can be vectorized. (new experimental
feature).
- No reductions (
sum += a[i]
) or
inductions (a[i] = i
).
- Constants and invariants are supported
(
a[i] = 5
, a[i] =
x
).
Examples of newly vectorizable loops: loop
- 2004-06-04, lno-branch (and
apple-ppc-branch)
-
Description of vectorizable loops:
- Inner most loops that consist of a single
basic block (i.e, straight-line code, no
if-then-else).
- The loop has to be countable - i.e, the number
of iterations can be evaluated before the loop
starts to execute.
- New: No restrictions on the form of the
loop index and the loop exit
condition.
- New: The loop bound (number of
iterations) can be unknown. Currently
known loop-bounds still need to be divisible by the
vectorization factor.
- All memory accesses are one dimensional-arrays,
whose alignment can be forced (not extern
arrays).
- All array accesses are consecutive and aligned;
i,e. all the array references are of the form
a[i]
, where i
is updated
from 0 to N in steps of 1.
- New: Supportable operations
include plus/minus/mult, as well as bitwise
operations - and/or/xor/1's-complement,
according to available vector support on the target
platform.
- All operations operate on data types of the
same size.
- No reductions (
sum += a[i]
) or
inductions (a[i] = i
).
- New: Constants and invariants are
supported (
a[i] = 5
, a[i] =
x
).
Examples of newly vectorizable loops: loop
- 2004-01-01, lno-branch
-
Description of vectorizable loops:
- Inner most loops that consist of a single
basic block (i.e, straight-line code, no
if-then-else).
- The loop has to be countable - i.e, the number
of iterations can be evaluated before the loop
starts to execute.
- Known (constant) loop bound, divisible by the
vectorization factor.
- The loop index
i
is updated from 0
to N in steps of 1, and the loop exit condition is
of the form i<N
.
- All memory accesses are one dimensional-arrays,
whose alignment can be forced (not extern
arrays).
- All array accesses are consecutive (stride=1)
and aligned.
- Supportable operations include plus/minus/mult,
according to available vector support on the target
platform.
- All operations operate on data types of the
same size.
- No reductions (
sum += a[i]
) or
inductions (a[i] = i
).
- No Constants or invariants in the loop
(
a[i] = 5
).
Examples of vectorizable loops: loop
References/Documentation
- "Vapor SIMD: Auto-vectorize once, run everywhere",
Dorit Nuzman, Sergei Dyshel, Erven Rohou, Ira Rosen, Kevin Williams,
David Yuste, Albert Cohen, Ayal Zaks, CGO 2011: 151-160
- "Polyhedral-Model Guided Loop-Nest Auto-Vectorization",
Konrad Trifunovic, Dorit Nuzman, Albert Cohen, Ayal Zaks and Ira Rosen,
PACT 2009
- "Outer-loop vectorization: revisited for short SIMD architectures",
Dorit Nuzman and Ayal Zaks, PACT 2008
- "Loop-Aware SLP in GCC", Ira Rosen, Dorit Nuzman
and Ayal Zaks, GCC summit, July 2007.
GCC Summit 2007 Proceedings
- "Autovectorization in GCC - two years later", Dorit Nuzman and Ayal Zaks,
GCC summit, June 2006.
GCC Summit 2006 Proceedings
- "Auto-Vectorization of Interleaved Data for SIMD",
Dorit Nuzman, Ira Rosen and Ayal Zaks.
ACM SIGPLAN 2006 Conference on Programming Language Design
and Implementation (PLDI), Ottawa, Canada, June 10-16, 2006.
- "Multi-platform Auto-vectorization", Dorit Nuzman and
Richard Henderson, CGO-4 (The 4th Annual International
Symposium on Code Generation and Optimization), March 26-29, 2006,
Manhattan, New York.
- "Autovectorization in GCC", Dorit Naishlos, GCC summit, June 2004.
https://gcc.gnu.org/pub/gcc/summit/2004/Autovectorization.pdf
- "The Software Vectorization Handbook. Applying Multimedia
Extensions for Maximum Performance.", Aart Bik, Intel Press,
June 2004.
- "Vectorization for SIMD Architectures with Alignment
Constraints", Alexandre E. Eichenberger, Peng Wu, Kevin O'brien,
PLDI'04, June 9-11 2004.
- "Optimizing
Compilers for Modern Architectures - A dependence based
approach", Randy Allen & Ken Kennedy, Morgan Kaufmann
Publishers, San Francisco, San Diego, New York (2001).
- "Exploiting Superword Level Parallelism with Multimedia
Instruction Sets", Samuel Larsen and Saman Amarasinghe,
PLDI 2000.
High-Level Plan of Implementation (2003-2005)
The table below outlines the high level vectorization scheme
along with a proposal for an implementation scheme, as
follows:
-
The first column ("vectorization driver") lists the
tasks that the vectorizer must consist of. It briefly
describes the expected functionality of each task.
-
The second column ("basic-vectorizer") describes a
proposal for a basic vectorizer that provides minimal
support for each of these tasks, listing the
restrictions that will be imposed by the basic
vectorizer on candidate loops. Loops that are
considered vectorizable by the basic-vectorizer are of
the form: for(i=0; i<N; i++) {a[i] = b[i] +
c[i]; }.
The "basic vectorizer" was implemented by
Dorit (Naishlos) Nuzman.
-
The third column ("enhancements") lists possible
directions for extending the capabilities of the basic
vectorizer. Some of these enhancements are aimed at
improving the quality of the vector code that is being
generated. Other enhancements aim at broadening the
range of computations that are amenable for
vectorization. Other focus on improved robustness.
Following the table is a complete and detailed list of
these enhancements. Next to each item which is already
being addressed, there should be the name of the
relevant contact person.
vectorization driver |
basic vectorizer |
enhancements |
analyze_loop_CFG(loop)
Checks the control flow
properties of the loop (number of
basic-blocks it consists of, nesting, single
entry/exit, etc.), in order to determine
whether the control flow of the loop falls
within the range of loop forms that are
supported by this vectorizer.
|
- inner-most (single nest) loops.
- single basic block loops; i.e., no
if-then-else constructs, etc. (in practice
this means loops that consist of exactly
two basic blocks - header+latch).
- other restrictions (single
successor/predecessor, a pre-header
block).
|
|
analyze_loop_index_and_bound(loop)
Analyzes the loop termination condition to
determine the loop bound and properties of the
loop index (its bounds and step). The
functionality of this utility should be largely
provided by the information computed by the
Induction Variable
Analyzer.
|
- handle simple normalized loops (loop
index is a trivial IV with step 1, etc.),
with a simple termination condition.
- loop-bound known at compile time.
loop-bound >= vector_size
and loop-bound % vector_size =
0 .
|
|
analyze_loop_stmts(loop-stmts)
Scan the loop statements and check whether
there are any statements that prohibit
vectorization (function calls, statements that
don't have a mapping to a built-in vector
function, etc.)
|
- simple operations for which there's a
1-1 mapping between the scalar and vector
operations.
- no support for scalar expansion,
induction variables, reduction
operations...
- no mixture of data types (all
statements operate on the same data
types).
|
|
analyze_access_pattern(loop-mem-refs)
Analyze the memory references in the loop,
and classify them
according to the access pattern that they
exhibit.
|
- support only memory accesses which are
array references (no pointers...).
- support only consecutive (unit stride)
access pattern.
|
|
analyze_alignment(loop-mem-refs)
Analyze the alignment of the memory
references in the loop. For each memory
reference, record its misalignment amount, if
it can be resolved at compile time.
|
- misalignment amount for all memory
references is known at compile time.
- misalignment is zero for all references
(all references are aligned).
|
|
analyze_loop_carried_dependences(loop)
Build the loop dependence graph (for scalar
and array references); Detect Strongly
Connected Components (SCCs) in the graph
(statements that are involved in a dependence
cycle); Perform a topological sort on the
reduced graph (in which each SCC is represented
by a single node); Only singleton nodes w/o
self dependencies can be vectorized. If other
(compound) nodes (which represent SCCs) are
present, loop transformations are required.
|
- handle only loops that do not contain
any SCCs (i.e., no dependence cycles).
- the only scalar loop-carried
dependencies allowed are of IVs which are
used in array references or for the loop
index (i.e., reduction is not
supported).
- use simplest dependence tests.
|
|
estimate_vectorization_profitability(loop)
At this point, it has been determined that
the loop is vectorizable. It remains to decide
whether it is indeed profitable to vectorize
it.
|
- vectorize all loops with loop_bound
>= MIN_LIMIT (?).
|
|
vectorize_loop(loop)
Replace the scalar statements with the
corresponding vector statements (which could be
calls to builtin functions); Also change the
loop bound accordingly.
|
|
|
The following is a list of independent directions by which
the basic vectorizer can be enhanced. It should be possible for
different people to work on different items on this list. Some
of these items are already under development, or (partially)
supported.
-
>Loop detection and loop CFG analysis
Detect loops, and record some basic control flow
information about them (contained basic blocks, loop
pre-header, exit and entry, etc.).
Status: Loop detection and control flow
analysis is already supported (cfgloop.c
,
cfgloopanal.c
).
-
Modeling the target machine vector capabilities to
the tree
-level.
Expose the required target specific information to
the tree
level. This includes providing a
mapping from scalar operations to the corresponding
vector support, which will answer the following
questions:
- Does vector support for this operation
exist?
- At what cost?
- How to express the vector operation at the
tree-
level?
The general SIMD support in GCC already provides
some initial support; For simple operations which can
be expressed using existing (scalar)
tree-codes
(PLUS_EXPR, MULT_EXPR,
etc.) the existing infra-structure can provide answers
for questions 1 and 2 above, however, the
tree-
level currently does not have an idea
about the cost that this transformation actually
entails. A first basic implementation will support only
simple operations that fall into the above category. As
the capabilities of the vectorizer are extended, it
will be required to inform the vectorizer of the
advanced capabilities available in the architecture
(for example, support for operations on complex
numbers, reduction, etc.). Such operations cannot be
expressed using existing tree-codes. Possible
solutions: introduce new tree-codes (and corresponding
optabs); introduce new builtins that are exposed to the
compiler; use target hooks to handle these cases (the
hook could return a call to a machine specific builtin
function). Another related design question that needs
to be addressed here is how much information to expose
to the tree-level (is it sufficient to indicate that
conditional vector addition is supported, or do we want
the vectorizer to actually generate the required
masking/predication/select operations depending on the
target? similarly for alignment, multiplication of
integers, etc.).
Status: Open for discussion.
Related discussion:
https://gcc.gnu.org/ml/gcc/2004-08/msg00317.html
-
Enhance the Builtins Support
Currently the tree optimizers do not know the
semantics of target specific builtin functions, so they
do not attempt to optimize them (or to SSA the
variables passed as arguments to these functions).
Since the vectorizer will probably end up generating
calls to target specific builtin functions, this
situation needs to be improved, i.e. - the semantics of
these builtins needs to somehow be exposed to the
compiler.
Status: Open for discussion.
-
Cost Model
There is an overhead associated with vectorization
-- moving data in to/out of vector registers
before/after the vectorized loop, aligning of data
accesses, etc. It is required to incorporate a cost
model into the machine description in order to allow
the vectorizer to evaluate whether it is worth while to
vectorize a given loop. One can also consider using run
time tests to decide which version of the loop to
execute (scalar or vectorized).
Status: Open for discussion.
Related discussion:
https://gcc.gnu.org/ml/gcc-patches/2003-09/msg00469.html
-
Induction Variable Analysis
Used by the vectorizer to detect loop bound, analyze
access patterns and analyze data dependencies between
array references. One option is that the dependence
tests would be designed deal with array references that
are expressed in terms of a linear function of the
iteration counter (in this case, vectorization will
also benefit from optimizations like induction variable
substitution and replacement of auxiliary IVs with
linear functions of the loop index). A dependence
tester that is based on IVs represented in this form
would analyze each subscript of each array reference,
and apply the appropriate dependence test (SIV, ZIV,
MIV etc., see dependence
testing). Alternatively, an induction variable
evolution analyzer could provide a different
implementation to the dependence tester. This is the
solution that is currently used, based on the Induction
Variable evolution analysis developed by Sebastian
Pop.
Status: Using the IV evolution analyzer
developed by Sebastian Pop.
-
Dependence Testing
Following the classic dependence-based approach for
vectorization as described in [1], apply dependence tests to pairs
of array references in each loop nest, and analyze the
resulting dependence graph. We will start from a
dependence analyzer that relies on the array references
being expressed in terms of a linear function of the
loop index, apply the simplest dependence tests to all
pairs of memory read/write and write/write, and
gradually extend its capabilities. The scheme below
follows the algorithm described in [2]:
- Partition the array subscripts into separable
sets (subscripts which index does not occur in
other subscripts), and apply the simplest tests for separable
subscripts (strong SIV (single index variable)
tests). [done]
- Incorporate more advanced dependence tests;
first, tests for separable subscripts - e.g., weak
SIV tests, MIV (Multiple Index Variable) tests,
followed by tests for coupled subscripts, possibly
up to integer linear programming like
Fourier-Motzkin elimination.
[some of these have been implemented]
- Compute dependence distance, and prune
dependencies with distance > vector_size. [done]
- Try to handle cycles in the dependence graph of
the loop (by performing loop distribution,
etc.).
- Generalize the dependence tester to nested
loops.
Status: Many of the tests above are implemented.
Omega test is in the works. We don't have a DDG graph
built based on the dependence tests.
-
Access Pattern Analysis
The memory architecture usually allows only
restricted accesses to data in memory; one of the
restrictions is that the accessed data must be
consecutive in memory. Any other accesses (strided for
example) require to perform special permutation of the
data in order to pack the data elements in the right
order into a vector register. Support for different
access patterns consists of the following stages:
- Classify the access pattern of each array
reference.[done, using the scalar evolution analyzer]
- Trivially handle consecutive (unit stride)
access patterns (a[i]). [done]
- Handle strided access patterns (a[2*i]). The
stride 2 access pattern appears in computations on
complex numbers, where the real and imaginary parts
are interleaved in the input/output array. [done]
- Handle other types of access patterns, e.g. reverse access,
strides that are not a power-of-2.
- Support pointer arithmetic. [done]
Status: Partial support in place.
-
Extend the range of supportable operations
At first, the only computations that will be
vectorized are those for which the vectorization
process consists of trivially replacing each scalar
operation in the loop with its vector counterpart. This
includes simple loads, stores and arithmetic operations
that operate on the same data type. Some computations
require extra code to be generated in order to
vectorize. These include:
- computations with mixed types; these require
proper promotion/demotion between vectors of
different sizes. [done]
- computations that involve loop invariants
(
a[i] = N
) and require scalar
expansion. [done]
- computations that involve induction variables
(
a[i] = i
), require scalar expansion, and
proper initialization and update code.
[planned]
Status: Partial support in place.
-
Alignment
The memory architecture usually allows only
restricted accesses to data in memory. One of the
restrictions is that data accesses need to be properly
aligned on a certain boundary. Even if the architecture
supports unaligned accesses, these are usually much
more costly than aligned accesses. The work on
alignment consists of several stages:
- Compute the misalignment properties of each
memory access. [Initial support in place.
Contact: Daniel Berlin]
- Handling of aligned memory accesses only (do
not attempt to vectorize loops that contain
unaligned accesses). [done]
- If it is impossible to determine at compile
time whether the memory access is aligned, create two versions of the loop and
use a runtime test to decide which version to
execute: the original scalar version (if the data
access is not aligned), or the vectorized version
(if the access is aligned). [done. Contact: Keith Besaw]
- Develop optimizations that increase the
alignment of data accesses (static loop peeling,
dynamic loop peeling, etc.). [done. Contact: Olga Golovanevsky]
- Vectorize unaligned accesses. There are
different ways to do that, depending on whether the
target supports unaligned accesses, and also
depending on what we want to implement at the tree
level, and what we want to leave for the RTL level
to handle. [done for loads]
- Build a cost model to decide when to apply
loop-peeling and/or loop-versioning to force alignment,
and when to generate unaligned vector accesses.
Status: Currently the way we handle unaligned
stores is by peeling the loop to force the alignment of
the store. This is not always applicable. Vectorizing
unaligned stores is in the works.
-
Idiom Recognition
It is often the case that complicated computations
can be reduced into a simpler, straight-line sequence
of operations. These operations may not be directly
supported in a scalar form, but are supported by the
target in a vector form. Such cases include:
Status: Partial support in place.
-
Conditional Execution
The general principle we are trying to follow is to
keep the actual code transformation part of the
vectorizer as simple as possible: a simple scan of
straight-line code, and a one-to-one replacement of
each scalar operation with the equivalent vector
operation. To support this scheme in the presence of
conditional execution, we'll need to flatten the loop
body by collapsing if-then-else into a conditional
(scalar) operation (something like transforming -
'if (x) {c = PLUS (a,b)}
' into
'PLUS_COND(a,b,x)
'. These will later be
replaced with a conditional vector operation using
whatever support is available on the target (masking,
predication or select operation). Flattening the loop
body this way will greatly simplify the vectorizer.
Some of the issues to think about here: (a) how to
represent these conditional operations, (b) to what
extent does the tree vectorizer need to be aware of the
specific target support that is available for
conditional vector execution (mask/predicate/select),
and (c) how to allow a simple way to reverse this
transformation if the loop doesn't end up getting
vectorized.
Status: Done. Contact: Devang Patel.
-
- Support general loop bound (unknown, or doesn't
divide by the vector size). [done. Contact: Olga
Golovanevsky]
- Support more complex forms of loop termination
condition and loop index update. [done]
- Support outer-loop vectorization (unroll and
jam). [planned]
- Relax other restrictions on the loop form.
Status: Partial support in place.
-
Handle Pointer Aliasing
- Improve aliasing analysis. [various GCC projects
deal with improvements to alias analysis].
- Generate run-time tests for cases where memory
anti-aliasing cannot be resolved at compile
time. [Planned. Contact: Keith Besaw]
- Support user hints. [in the works. See
https://gcc.gnu.org/ml/gcc-patches/2005-02/msg01560.html.
Contact: Devang Patel]
Status: In the works.
-
Aliasing and Virtual def-use Chains
Address the item from the tree-ssa todo list - "SSA
information for arrays : The existing implementation
treats arrays as an opaque object. A definition to an
array location is treated as a definition for the whole
array"
Status: Open for discussion.
-
Array addressing Represented as Pointer
Arithmetic
Address the issue mentioned in
https://gcc.gnu.org/ml/gcc/2003-07/msg02013.html,
which turns out to be a front end issue.
Status: In the works. See
https://gcc.gnu.org/wiki/Array_references_on_pointers
and
https://gcc.gnu.org/ml/gcc-patches/2005-05/msg01577.html.
-
Loop versioning
Provide utilities that allow performing the
following transformation: Given a condition and a loop,
create -'if (condition) { loop_copy1 } else {
loop_copy2 }', where loop_copy1 is the loop transformed
in one way, and loop_copy2 is the loop transformed in
another way (or unchanged). 'condition' may be a run
time test for things that were not resolved by static
analysis (overlapping ranges (anti-aliasing),
alignment, etc.).
Status: Done. Contact: Devang Patel.
-
These include:
- loop interchange, and other unimodular
transformations.
- loop distribution.
- collapsing tightly nested loops to a single
loop.
- loop unswitching.
Status:Linear loop transformations are
implemented by Daniel Berlin.
-
Other Optimizations
- Exploit data reuse (a la "Compiler-Controlled
Caching in Superword Register Files for Multimedia
Extension Architectures" by Shin, Chame and Hall).
Mostly relies on unroll & jam having been
applied.
- Vectorize loops that can't be vectorized using
the classic vectorizer (until the proper loop
transformations are developed) by applying SLP
vectorization (a la "Exploiting Superword Level
Parallelism with Multimedia Instruction Sets" by
Amarasinghe and Larsen). This scheme can
potentially more easily vectorize partially
vectorizable loops, or loops that are already
unrolled in the source code. It is possible to
implement SLP vectorization either in the tree
level or at the RTL level as a complementary
approach to classic loop vectorization.
Status: Future work.
-
User Hints
Using user hints for different purposes
(aliasing, alignment, profitability of vectorizing
a loop, etc.).
Status: In the works. See
https://gcc.gnu.org/ml/gcc-patches/2005-02/msg01560.html.