Appendix: Compiler Optimization Reference

This appendix provides a consolidated reference for compiler optimization concepts, flags, and techniques discussed throughout the book.

A. Common Compiler Optimization Flags

GCC/Clang Optimization Flags

Flag Description
-O0 No optimization (default), fastest compilation, best for debugging
-O1 Basic optimizations, reasonable compilation speed and performance
-O2 More aggressive optimizations without significant space/speed tradeoffs
-O3 Maximum optimization including vectorization, function inlining
-Os Optimize for size rather than speed
-Ofast -O3 plus optimizations that may violate strict standards compliance
-flto Enable link-time optimization
-fprofile-generate Instrument code for profile-guided optimization (first stage)
-fprofile-use Use collected profile data for optimization (second stage)
-march=native Optimize for the CPU architecture of the compiling machine
-mtune=native Tune code for the CPU architecture without using newer instructions
-ffast-math Enable aggressive floating-point optimizations (may affect precision)
-funroll-loops Unroll loops when beneficial
-fomit-frame-pointer Don’t keep frame pointer in register (frees up register)
-fvectorize Enable auto-vectorization (on by default in -O3)

MSVC Optimization Flags

Flag Description
/Od Disable optimization (debug builds)
/O1 Minimize size
/O2 Maximize speed (default for release builds)
/Ox Maximum optimizations
/Os Favor small code
/Ot Favor fast code
/GL Enable whole program optimization
/arch:AVX2 Generate code using AVX2 instructions
/fp:fast Enable aggressive floating-point optimizations
/Qpar Enable auto-parallelization
/Qvec-report:2 Report auto-vectorization results
/LTCG Link-time code generation
/LTCG:PGINSTRUMENT Instrument code for profile-guided optimization
/LTCG:PGOPTIMIZE Use collected profile data for optimization

B. Compiler Attributes and Pragmas

GCC/Clang Attributes

Attribute Description Example
__attribute__((pure)) Function has no side effects __attribute__((pure)) int calculate(int x);
__attribute__((const)) Function depends only on arguments __attribute__((const)) int square(int x);
__attribute__((hot)) Function is frequently executed __attribute__((hot)) void critical_function();
__attribute__((cold)) Function is rarely executed __attribute__((cold)) void error_handler();
__attribute__((always_inline)) Always inline this function __attribute__((always_inline)) inline void fast_fn();
__attribute__((noinline)) Never inline this function __attribute__((noinline)) void dont_inline();
__attribute__((aligned(N))) Align variable to N bytes __attribute__((aligned(16))) float vector[4];
__attribute__((packed)) No padding in structure struct __attribute__((packed)) PackedStruct { ... };
__attribute__((noreturn)) Function never returns __attribute__((noreturn)) void exit_program();

MSVC Attributes

Attribute Description Example
__declspec(noinline) Never inline this function __declspec(noinline) void function();
__declspec(noreturn) Function never returns __declspec(noreturn) void terminate();
__forceinline Force function inlining __forceinline int fast_function();
__declspec(align(N)) Align variable to N bytes __declspec(align(16)) float vector[4];
__declspec(restrict) Pointer doesn’t alias void process(__declspec(restrict) int* data);
__declspec(novtable) Class with no vtable instances class __declspec(novtable) Interface {...};

Common Pragma Directives

Pragma Description Example
#pragma once Include guard (alternative to ifdef) #pragma once
#pragma pack(N) Set structure alignment #pragma pack(1)
#pragma omp parallel OpenMP parallel region #pragma omp parallel for
#pragma GCC ivdep Assume no loop-carried dependencies #pragma GCC ivdep
#pragma unroll Suggest loop unrolling #pragma unroll(4)
#pragma GCC push_options Save optimization settings #pragma GCC push_options
#pragma GCC optimize Set optimization level locally #pragma GCC optimize("O3")

C. Optimization Techniques Summary

Loop Optimizations

Technique Description Performance Impact
Loop unrolling Replicate loop body to reduce loop overhead Reduces branch mispredictions, increases instruction-level parallelism
Loop interchange Change nested loop order for better memory access Improves cache utilization
Loop fusion Combine multiple loops over the same data Reduces loop overhead, improves cache utilization
Loop fission Split complex loops into simpler ones Can improve instruction cache, enable vectorization
Loop-invariant code motion Move computations out of loops Reduces redundant computation
Loop tiling/blocking Process data in cache-friendly chunks Dramatically improves cache performance
Loop unswitching Move conditionals outside loops Eliminates branch mispredictions
Loop peeling Handle edge cases separately Enables more aggressive optimizations

Memory Optimizations

Technique Description Performance Impact
Dead store elimination Remove writes never read Reduces memory traffic
Load/store forwarding Replace loads with already computed values Reduces memory dependencies
Memory access coalescing Combine multiple accesses to same cache line Reduces memory traffic
Scalar replacement Replace memory accesses with register accesses Faster access, reduces pressure on memory system
Structure splitting Split large structures for better access patterns Improves cache utilization
Structure reordering Arrange structure fields by access frequency Improves cache utilization
Buffer padding Add padding to avoid false sharing Reduces cache coherence traffic

Function Optimizations

Technique Description Performance Impact
Function inlining Replace function call with body Reduces call overhead, enables other optimizations
Tail call optimization Optimize recursive calls in tail position Reduces stack usage
Interprocedural optimization Optimize across function boundaries Enables holistic optimizations
Function specialization Create optimized versions for specific cases Better optimized code paths
Devirtualization Replace virtual calls with direct calls Eliminates indirection, enables inlining
Cross-module optimization Optimize across compilation units Enables broader optimizations

SIMD Optimizations

Technique Description Performance Impact
Auto-vectorization Automatically use SIMD instructions Process multiple elements in parallel
Loop vectorization Transform scalar loops to vector operations 2-16x speedup depending on data width
SLP vectorization Vectorize straight-line code Improves throughput for sequential operations
Mixed-width vectorization Utilize different vector widths Better resource utilization
Vector predication Handle partial vector operations Enables vectorization with conditionals
Permutation optimization Minimize vector shuffling Reduces overhead in vector code

D. Common Optimization Barriers

Barrier Description Mitigation
Pointer aliasing Uncertainty if pointers refer to same memory Use restrict keyword or equivalent
Complex control flow Many branches that depend on input Simplify logic, use data-oriented design
Function calls Especially virtual or indirect calls Consider inlining or devirtualization
Memory dependencies Data hazards between operations Reorder operations when safe
Exception handling Try/catch blocks limit optimization Minimize use in performance-critical paths
Non-contiguous memory access Cache-unfriendly memory patterns Restructure data for better locality
Thread synchronization Locks and atomic operations Minimize synchronization in hot paths
I/O operations System calls and external interactions Buffer I/O outside performance-critical sections

E. Assembly Instruction Reference

Below is a quick reference for common x86-64 instruction types relevant to performance optimization.

Core Instructions

Instruction Type Examples Performance Notes
Data movement MOV, MOVSX, MOVZX Register-to-register fastest
Arithmetic ADD, SUB, MUL, DIV Integer MUL ~3-5 cycles, DIV ~10-20 cycles
Logical AND, OR, XOR, NOT Single-cycle latency
Shifts/rotates SHL, SHR, ROL, ROR Single-cycle for constant shifts
Branches JMP, JE, JNE, JG, JL Mispredictions cost ~15-20 cycles
Function CALL, RET Can disrupt pipeline and instruction cache
Stack PUSH, POP Combine multiple into single stack adjustment when possible

SIMD Instructions

Instruction Set Examples Width
SSE MOVAPS, ADDPS, MULPS 128-bit (4 floats)
AVX VMOVAPS, VADDPS, VMULPS 256-bit (8 floats)
AVX-512 VMOVAPS, VADDPS, VMULPS 512-bit (16 floats)
AVX-512 Mask VADDPS{k}, VMULPS{k} Conditional operations
FMA VFMADD213PS Fused multiply-add (better precision)

F. Profiling Tools

Tool Platform Description
perf Linux Sampling-based system profiler
Valgrind Linux, macOS Instrumentation-based profiler
gprof Cross-platform Function-level profiling
Intel VTune Cross-platform Advanced profiling for Intel CPUs
AMD uProf Cross-platform Profiling for AMD processors
Visual Studio Profiler Windows Integrated profiling in Visual Studio
Instruments macOS Apple’s profiling toolkit
Tracy Cross-platform Frame profiler for games/graphics
Flame Graphs Various Visualization technique for hierarchical profiles

G. Additional Resources

Books

  • “Optimizing Software in C++” by Agner Fog
  • “Computer Systems: A Programmer’s Perspective” by Bryant and O’Hallaron
  • “Engineering a Compiler” by Cooper and Torczon
  • “Performance Analysis and Tuning on Modern CPUs” by Denis Bakhvalov

Online References

Papers

  • “Automatic SIMD Vectorization of Fast Fourier Transforms for the ARM and x86 Architectures” - Mitra et al.
  • “LLVM: A Compilation Framework for Lifelong Program Analysis & Transformation” - Lattner and Adve
  • “Optimizing for Instruction Caches” - McFarling

H. Glossary of Terms

Term Definition
Aliasing When multiple pointers can refer to the same memory location
Auto-vectorization Compiler automatically transforms scalar code to use SIMD instructions
Basic block Straight-line code sequence with no branches except at entry and exit
Constant folding Computing constant expressions at compile time
CSE (Common Subexpression Elimination) Avoiding redundant computations
DCE (Dead Code Elimination) Removing code that doesn’t affect program output
Function inlining Replacing a function call with the function’s body
ILP (Instruction-Level Parallelism) Multiple instructions executed simultaneously
JIT (Just-in-Time) compilation Compiling code during program execution
LICM (Loop-Invariant Code Motion) Moving unchanged calculations outside loops
LTO (Link-Time Optimization) Optimization across multiple compilation units
PGO (Profile-Guided Optimization) Using runtime data to guide optimization
Register allocation Assigning variables to CPU registers
SIMD (Single Instruction, Multiple Data) Processing multiple data elements with one instruction
Strength reduction Replacing expensive operations with cheaper ones
TLB (Translation Lookaside Buffer) Cache for virtual-to-physical address mappings

This appendix serves as a quick reference. For detailed explanation of these concepts, refer to the corresponding chapters in the book.