Compute Bound Code¶

Many Ways of Optimization¶

There are many ways to try to optimize code …

Unnecessary ones
Using better algorithms (e.g. sorting and binary search)
Function call elimination (inlining vs. spaghetti)
Loop unrolling
Strength reduction (e.g. using shift instead of mult/div)
Tail call elimination
…

Unnecessary Optimizations¶

if (x != 0)
    x = 0;

The rumour goes that this is not faster than unconditional writing
Produces more instructions, at least

Inlining (1)¶

Facts up front:

Function calls are generally fast
A little slower when definition is in a shared library
Instruction cache, if used judiciously, makes repeated calls even faster
But, as always: it depends

Possible inlining candidate

int add(int l, int r)
{
    return l + r;
}

Inlining (2)¶

A couple rules

Always write clear code
Never not define a function because of performance reasons
- Readability first
- Can always inline later, during optimization
Don’t inline large functions ⟶ instruction cache pollution when called from different locations
Use static for implementation specific functions ⟶ compiler has much more freedom

Inlining (3)¶

GCC …

Does not optimize by default
Ignores explicit texttt{inline} when not optimizing
-finline-small-functions (enabled at -O2): inline when function call overhead is larger than body (even when not declared inline)
-finline-functions (enabled at -O3): all functions considered for inlining ⟶ heuristics
-finline-functions-called-once (enabled at -O1, -O2, -O3, -Os`): all static functions that …
More ⟶ info gcc

Register Allocation (1)¶

Register access is orders of magnitude faster than main memory access
- ⟶ Best to keep variables in registers rather than memory
CPUs have varying numbers of registers
- register keyword should not be overused
- Ignored anyway by most compilers
Register allocation
- Compiler performs flow analysis
- Live vs. dead variables
- “Spills” registers when allocation changes

Compiler generally makes better choices than the programmer!

Register Allocation (2)¶

GCC …

-fira-* (for Integrated Register Allocator)
RTFM please
A lot of tuning opportunities for those who care

Peephole Optimization¶

Peephole: manageable set of instructions; “window”
Common term for a group of optimizations that operate on a small scale
- Common subexpression elimination
- Strength reduction
- Constant folding
Small scale ⟶ “basic block”

Peephole Optimization: Common Subexpression Elimination¶

Sometimes one writes redundant code, in order to not compromise readability by introducing yet another variable …

a = b + c + d;
x = b + c + y;

This can be transformed into

tmp = b + c; /* common subexpression */
a = tmp + d;
x = tmp + y;

Peephole Optimization: Strength Reduction¶

Most programmers prefer to say what they mean (fortunately) …

x = y * 2;

The same effect, but cheaper, is brought about by …

x = y << 1;

If one knows the “strength” of the operators involved (compilers tend to know), then even this transformation can be opportune …

x = y * 3; /* y*(4-1) == y*4-y */
x = (y << 2) - y;

Peephole Optimization: Constant Folding¶

Another one that might look stupid but readable …

x = 42;
y = x + 1;

… is likely to be transformed into …

x = 42;
y = 43;

Consider transitive and repeated folding and propagation $to$ pretty results

Loop Invariants¶

The following bogus code …

while (1) {
    x = 42; /* loop invariant */
    y += 2;
}

… will likely end up as …

x = 42;
while (1)
    y += 2;

At least with a minimal amount of optimization enabled (GCC: -fmove-loop-invariants, enabled with -O1 already)

Loop Unrolling¶

If a loop body is run a known number of times, the loop counter can be omitted.

for (i=0; i<4; i++)
    dst[i] = src[i];

This can be written as

dst[0] = src[0];
dst[1] = src[1];
dst[2] = src[2];
dst[3] = src[3];

Complicated heuristics: does the performance gain outweigh instruction cache thrashing?
⟶ I’d keep my fingers from it!

Tail Call Optimization¶

int f(int i)
{
   do_something(i);
   return g(i+1);
}

g() is called at the end
f()’s stack frame is not used afterwards
Optimization: g() can use f()’s stack frame

CPU Optimization, Last Words¶

Once more: Write clean Code!

All optimization techniques explained are performed automatically, by the compiler
Theory behind optimization is well understood ⟶ engineering discipline
Compilers generally perform optimizations better than a programmer would
- … let alone portably, on different CPUs!
“Optimization” is a misnomer ⟶ “Improvement”
- Compiler cannot make arbitrary code “optimal”
- Bigger picture is always up to the programmer
- ⟶ Once more: Write clean Code!
Work together with compiler ⟶ use static, const

GCC: Optimization “Levels”¶

-O0: optimization off; the default
-O1: most basic optimizations; does as much as possible without compromising compilation time too much
-O2: recommended; does everything which has no size impact, is
unagressive, and doesn’t completely chew compilation time
-O3: highest level possible; somewhat agressive, can break things sometimes, eats up your CPU and memory while compiling
-Os: optimize for size; all of -O2 that doesn’t increase size
-Og (since GCC 4.8): “developer mode”; turns on options that don’t interfere with debugging or compilation time