.. include:: .. jf-topic:: c.performance.computebound :dependencies: c.performance.introduction Compute Bound Code ================== .. https://class.coursera.org/compilers/lecture .. http://en.wikibooks.org/wiki/Compiler_Construction/Optimization .. http://en.wikipedia.org/wiki/Optimizing_compiler#Optimization_techniques_2 .. .. http://wiki.gentoo.org/wiki/GCC_optimization .. -------------------------------------------- .. .. -O0: This level (that's the letter "O" followed by a zero) turns .. off optimization entirely and is the default if no -O level is .. specified in CFLAGS or CXXFLAGS. This reduces compilation time and .. can improve debugging info, but some applications will not work .. properly without optimization enabled. This option is not .. recommended except for debugging purposes. .. .. -O1: This is the most basic optimization level. The compiler will .. try to produce faster, smaller code without taking much .. compilation time. It's pretty basic, but it should get the job .. done all the time. .. .. -O2: A step up from -O1. This is the recommended level of .. optimization unless you have special needs. -O2 will activate a .. few more flags in addition to the ones activated by -O1. With -O2, .. the compiler will attempt to increase code performance without .. compromising on size, and without taking too much compilation .. time. .. .. -O3: This is the highest level of optimization possible. It .. enables optimizations that are expensive in terms of compile time .. and memory usage. Compiling with -O3 is not a guaranteed way to .. improve performance, and in fact in many cases can slow down a .. system due to larger binaries and increased memory usage. -O3 is .. also known to break several packages. Therefore, using -O3 is not .. recommended. .. .. -Os: This option will optimize your code for size. It activates .. all -O2 options that don't increase the size of the generated .. code. It can be useful for machines that have extremely limited .. disk storage space and/or have CPUs with small cache sizes. .. .. -Og: In GCC 4.8, a new general optimization level, -Og, has been .. introduced. It addresses the need for fast compilation and a .. superior debugging experience while providing a reasonable level .. of runtime performance. Overall experience for development should .. be better than the default optimization level -O0. Note that -Og .. does not imply -g, it simply disables optimizations that may .. interfere with debugging. .. .. -Ofast: New in GCC 4.7, consists of -O3 plus -ffast-math, .. -fno-protect-parens, and -fstack-arrays. This option breaks strict .. standards compliance, and is not recommended for use. .. .. As previously mentioned, -O2 is the recommended optimization level. If .. package compilation fails and you aren't using -O2, try rebuilding .. with that option. As a fallback option, try setting your CFLAGS and .. CXXFLAGS to a lower optimization level, such as -O1 or even -O0 -g2 .. -ggdb (for error reporting and checking for possible problems). Many Ways of Optimization ------------------------- **There are many ways to try to optimize code** ... * Unnecessary ones * Using better algorithms (e.g. sorting and binary search) * Function call elimination (inlining vs. spaghetti) * Loop unrolling * Strength reduction (e.g. using shift instead of mult/div) * Tail call elimination * ... Unnecessary Optimizations ------------------------- .. code-block:: c if (x != 0) x = 0; * The rumour goes that this is not faster than unconditional writing * Produces more instructions, at least Inlining (1) ------------ **Facts up front:** * Function calls are generally fast * A little slower when definition is in a shared library * Instruction cache, if used judiciously, makes repeated calls even faster * But, as always: it depends **Possible inlining candidate** .. code-block:: c int add(int l, int r) { return l + r; } Inlining (2) ------------ **A couple rules** * Always write clear code * Never *not* define a function because of performance reasons * *Readability first* * Can always inline later, during optimization * Don't inline large functions |longrightarrow| instruction cache pollution when called from different locations * Use ``static`` for implementation specific functions |longrightarrow| compiler has much more freedom Inlining (3) ------------ **GCC ...** * Does not optimize by default * Ignores explicit \texttt{inline} when not optimizing * ``-finline-small-functions`` (enabled at ``-O2``): inline when function call overhead is larger than body (even when not declared inline) * ``-finline-functions`` (enabled at ``-O3``): all functions considered for inlining |longrightarrow| heuristics * ``-finline-functions-called-once`` (enabled at ``-O1, -O2, -O3, -Os```): all ``static`` functions that ... * More |longrightarrow| ``info gcc`` Register Allocation (1) ----------------------- * Register access is orders of magnitude faster than main memory access * |longrightarrow| Best to keep variables in registers rather than memory * CPUs have varying numbers of registers * ``register`` keyword should not be overused * Ignored anyway by most compilers * Register allocation * Compiler performs flow analysis * Live vs. dead variables * "Spills" registers when allocation changes **Compiler generally makes better choices than the programmer!** Register Allocation (2) ----------------------- **GCC ...** * ``-fira-*`` (for Integrated Register Allocator) * RTFM please * A *lot* of tuning opportunities for those who care Peephole Optimization --------------------- * **Peephole**: manageable set of instructions; "window" * Common term for a group of optimizations that operate on a small scale * Common subexpression elimination * Strength reduction * Constant folding * Small scale |longrightarrow| "basic block" Peephole Optimization: Common Subexpression Elimination ------------------------------------------------------- Sometimes one writes redundant code, in order to not compromise readability by introducing yet another variable ... .. code-block:: c a = b + c + d; x = b + c + y; This can be transformed into .. code-block:: c tmp = b + c; /* common subexpression */ a = tmp + d; x = tmp + y; Peephole Optimization: Strength Reduction ----------------------------------------- Most programmers prefer to say what they mean (fortunately) ... .. code-block:: c x = y * 2; The same effect, but cheaper, is brought about by ... .. code-block:: c x = y << 1; If one knows the "strength" of the operators involved (compilers tend to know), then even this transformation can be opportune ... .. code-block:: c x = y * 3; /* y*(4-1) == y*4-y */ x = (y << 2) - y; Peephole Optimization: Constant Folding --------------------------------------- Another one that might look stupid but readable ... .. code-block:: c x = 42; y = x + 1; ... is likely to be transformed into ... .. code-block:: c x = 42; y = 43; Consider transitive and repeated folding and propagation $\to$ pretty results Loop Invariants --------------- The following bogus code ... .. code-block:: c while (1) { x = 42; /* loop invariant */ y += 2; } ... will likely end up as ... .. code-block:: c x = 42; while (1) y += 2; At least with a minimal amount of optimization enabled (GCC: ``-fmove-loop-invariants``, enabled with ``-O1`` already) Loop Unrolling -------------- If a loop body is run a known number of times, the loop counter can be omitted. .. code-block:: c for (i=0; i<4; i++) dst[i] = src[i]; This can be written as .. code-block:: c dst[0] = src[0]; dst[1] = src[1]; dst[2] = src[2]; dst[3] = src[3]; * *Complicated heuristics*: does the performance gain outweigh instruction cache thrashing? * |longrightarrow| I'd keep my fingers from it! Tail Call Optimization ---------------------- .. code-block:: c int f(int i) { do_something(i); return g(i+1); } * ``g()`` is called *at the end* * ``f()``'s stack frame is not used afterwards * **Optimization:** ``g()`` can use ``f()``'s stack frame CPU Optimization, Last Words ---------------------------- Once more: **Write clean Code!** * All optimization techniques explained are performed *automatically*, by the compiler * Theory behind optimization is well understood |longrightarrow| engineering discipline * Compilers generally perform optimizations better than a programmer would * ... let alone portably, on different CPUs! * "Optimization" is a misnomer |longrightarrow| "Improvement" * Compiler cannot make arbitrary code "optimal" * Bigger picture is always up to the programmer * |longrightarrow| Once more: **Write clean Code!** * Work together with compiler |longrightarrow| use ``static``, ``const`` GCC: Optimization "Levels" -------------------------- * ``-O0``: optimization off; the default * ``-O1``: most basic optimizations; does as much as possible without compromising compilation time too much * ``-O2``: recommended; does everything which has no size impact, is unagressive, and doesn't completely chew compilation time * ``-O3``: highest level possible; somewhat agressive, can break things sometimes, eats up your CPU and memory while compiling * ``-Os``: optimize for size; all of ``-O2`` that doesn't increase size * ``-Og`` (since GCC 4.8): "developer mode"; turns on options that don't interfere with debugging or compilation time