This appendix provides a consolidated reference for compiler optimization concepts, flags, and techniques discussed throughout the book.
A. Common Compiler Optimization Flags
GCC/Clang Optimization Flags
Flag |
Description |
-O0 |
No optimization (default), fastest compilation, best for debugging |
-O1 |
Basic optimizations, reasonable compilation speed and performance |
-O2 |
More aggressive optimizations without significant space/speed tradeoffs |
-O3 |
Maximum optimization including vectorization, function inlining |
-Os |
Optimize for size rather than speed |
-Ofast |
-O3 plus optimizations that may violate strict standards compliance |
-flto |
Enable link-time optimization |
-fprofile-generate |
Instrument code for profile-guided optimization (first stage) |
-fprofile-use |
Use collected profile data for optimization (second stage) |
-march=native |
Optimize for the CPU architecture of the compiling machine |
-mtune=native |
Tune code for the CPU architecture without using newer instructions |
-ffast-math |
Enable aggressive floating-point optimizations (may affect precision) |
-funroll-loops |
Unroll loops when beneficial |
-fomit-frame-pointer |
Don’t keep frame pointer in register (frees up register) |
-fvectorize |
Enable auto-vectorization (on by default in -O3) |
MSVC Optimization Flags
Flag |
Description |
/Od |
Disable optimization (debug builds) |
/O1 |
Minimize size |
/O2 |
Maximize speed (default for release builds) |
/Ox |
Maximum optimizations |
/Os |
Favor small code |
/Ot |
Favor fast code |
/GL |
Enable whole program optimization |
/arch:AVX2 |
Generate code using AVX2 instructions |
/fp:fast |
Enable aggressive floating-point optimizations |
/Qpar |
Enable auto-parallelization |
/Qvec-report:2 |
Report auto-vectorization results |
/LTCG |
Link-time code generation |
/LTCG:PGINSTRUMENT |
Instrument code for profile-guided optimization |
/LTCG:PGOPTIMIZE |
Use collected profile data for optimization |
B. Compiler Attributes and Pragmas
GCC/Clang Attributes
Attribute |
Description |
Example |
__attribute__((pure)) |
Function has no side effects |
__attribute__((pure)) int calculate(int x); |
__attribute__((const)) |
Function depends only on arguments |
__attribute__((const)) int square(int x); |
__attribute__((hot)) |
Function is frequently executed |
__attribute__((hot)) void critical_function(); |
__attribute__((cold)) |
Function is rarely executed |
__attribute__((cold)) void error_handler(); |
__attribute__((always_inline)) |
Always inline this function |
__attribute__((always_inline)) inline void fast_fn(); |
__attribute__((noinline)) |
Never inline this function |
__attribute__((noinline)) void dont_inline(); |
__attribute__((aligned(N))) |
Align variable to N bytes |
__attribute__((aligned(16))) float vector[4]; |
__attribute__((packed)) |
No padding in structure |
struct __attribute__((packed)) PackedStruct { ... }; |
__attribute__((noreturn)) |
Function never returns |
__attribute__((noreturn)) void exit_program(); |
MSVC Attributes
Attribute |
Description |
Example |
__declspec(noinline) |
Never inline this function |
__declspec(noinline) void function(); |
__declspec(noreturn) |
Function never returns |
__declspec(noreturn) void terminate(); |
__forceinline |
Force function inlining |
__forceinline int fast_function(); |
__declspec(align(N)) |
Align variable to N bytes |
__declspec(align(16)) float vector[4]; |
__declspec(restrict) |
Pointer doesn’t alias |
void process(__declspec(restrict) int* data); |
__declspec(novtable) |
Class with no vtable instances |
class __declspec(novtable) Interface {...}; |
Common Pragma Directives
Pragma |
Description |
Example |
#pragma once |
Include guard (alternative to ifdef) |
#pragma once |
#pragma pack(N) |
Set structure alignment |
#pragma pack(1) |
#pragma omp parallel |
OpenMP parallel region |
#pragma omp parallel for |
#pragma GCC ivdep |
Assume no loop-carried dependencies |
#pragma GCC ivdep |
#pragma unroll |
Suggest loop unrolling |
#pragma unroll(4) |
#pragma GCC push_options |
Save optimization settings |
#pragma GCC push_options |
#pragma GCC optimize |
Set optimization level locally |
#pragma GCC optimize("O3") |
C. Optimization Techniques Summary
Loop Optimizations
Technique |
Description |
Performance Impact |
Loop unrolling |
Replicate loop body to reduce loop overhead |
Reduces branch mispredictions, increases instruction-level parallelism |
Loop interchange |
Change nested loop order for better memory access |
Improves cache utilization |
Loop fusion |
Combine multiple loops over the same data |
Reduces loop overhead, improves cache utilization |
Loop fission |
Split complex loops into simpler ones |
Can improve instruction cache, enable vectorization |
Loop-invariant code motion |
Move computations out of loops |
Reduces redundant computation |
Loop tiling/blocking |
Process data in cache-friendly chunks |
Dramatically improves cache performance |
Loop unswitching |
Move conditionals outside loops |
Eliminates branch mispredictions |
Loop peeling |
Handle edge cases separately |
Enables more aggressive optimizations |
Memory Optimizations
Technique |
Description |
Performance Impact |
Dead store elimination |
Remove writes never read |
Reduces memory traffic |
Load/store forwarding |
Replace loads with already computed values |
Reduces memory dependencies |
Memory access coalescing |
Combine multiple accesses to same cache line |
Reduces memory traffic |
Scalar replacement |
Replace memory accesses with register accesses |
Faster access, reduces pressure on memory system |
Structure splitting |
Split large structures for better access patterns |
Improves cache utilization |
Structure reordering |
Arrange structure fields by access frequency |
Improves cache utilization |
Buffer padding |
Add padding to avoid false sharing |
Reduces cache coherence traffic |
Function Optimizations
Technique |
Description |
Performance Impact |
Function inlining |
Replace function call with body |
Reduces call overhead, enables other optimizations |
Tail call optimization |
Optimize recursive calls in tail position |
Reduces stack usage |
Interprocedural optimization |
Optimize across function boundaries |
Enables holistic optimizations |
Function specialization |
Create optimized versions for specific cases |
Better optimized code paths |
Devirtualization |
Replace virtual calls with direct calls |
Eliminates indirection, enables inlining |
Cross-module optimization |
Optimize across compilation units |
Enables broader optimizations |
SIMD Optimizations
Technique |
Description |
Performance Impact |
Auto-vectorization |
Automatically use SIMD instructions |
Process multiple elements in parallel |
Loop vectorization |
Transform scalar loops to vector operations |
2-16x speedup depending on data width |
SLP vectorization |
Vectorize straight-line code |
Improves throughput for sequential operations |
Mixed-width vectorization |
Utilize different vector widths |
Better resource utilization |
Vector predication |
Handle partial vector operations |
Enables vectorization with conditionals |
Permutation optimization |
Minimize vector shuffling |
Reduces overhead in vector code |
D. Common Optimization Barriers
Barrier |
Description |
Mitigation |
Pointer aliasing |
Uncertainty if pointers refer to same memory |
Use restrict keyword or equivalent |
Complex control flow |
Many branches that depend on input |
Simplify logic, use data-oriented design |
Function calls |
Especially virtual or indirect calls |
Consider inlining or devirtualization |
Memory dependencies |
Data hazards between operations |
Reorder operations when safe |
Exception handling |
Try/catch blocks limit optimization |
Minimize use in performance-critical paths |
Non-contiguous memory access |
Cache-unfriendly memory patterns |
Restructure data for better locality |
Thread synchronization |
Locks and atomic operations |
Minimize synchronization in hot paths |
I/O operations |
System calls and external interactions |
Buffer I/O outside performance-critical sections |
E. Assembly Instruction Reference
Below is a quick reference for common x86-64 instruction types relevant to performance optimization.
Core Instructions
Instruction Type |
Examples |
Performance Notes |
Data movement |
MOV, MOVSX, MOVZX |
Register-to-register fastest |
Arithmetic |
ADD, SUB, MUL, DIV |
Integer MUL ~3-5 cycles, DIV ~10-20 cycles |
Logical |
AND, OR, XOR, NOT |
Single-cycle latency |
Shifts/rotates |
SHL, SHR, ROL, ROR |
Single-cycle for constant shifts |
Branches |
JMP, JE, JNE, JG, JL |
Mispredictions cost ~15-20 cycles |
Function |
CALL, RET |
Can disrupt pipeline and instruction cache |
Stack |
PUSH, POP |
Combine multiple into single stack adjustment when possible |
SIMD Instructions
Instruction Set |
Examples |
Width |
SSE |
MOVAPS, ADDPS, MULPS |
128-bit (4 floats) |
AVX |
VMOVAPS, VADDPS, VMULPS |
256-bit (8 floats) |
AVX-512 |
VMOVAPS, VADDPS, VMULPS |
512-bit (16 floats) |
AVX-512 Mask |
VADDPS{k}, VMULPS{k} |
Conditional operations |
FMA |
VFMADD213PS |
Fused multiply-add (better precision) |
Tool |
Platform |
Description |
perf |
Linux |
Sampling-based system profiler |
Valgrind |
Linux, macOS |
Instrumentation-based profiler |
gprof |
Cross-platform |
Function-level profiling |
Intel VTune |
Cross-platform |
Advanced profiling for Intel CPUs |
AMD uProf |
Cross-platform |
Profiling for AMD processors |
Visual Studio Profiler |
Windows |
Integrated profiling in Visual Studio |
Instruments |
macOS |
Apple’s profiling toolkit |
Tracy |
Cross-platform |
Frame profiler for games/graphics |
Flame Graphs |
Various |
Visualization technique for hierarchical profiles |
G. Additional Resources
Books
- “Optimizing Software in C++” by Agner Fog
- “Computer Systems: A Programmer’s Perspective” by Bryant and O’Hallaron
- “Engineering a Compiler” by Cooper and Torczon
- “Performance Analysis and Tuning on Modern CPUs” by Denis Bakhvalov
Online References
Papers
- “Automatic SIMD Vectorization of Fast Fourier Transforms for the ARM and x86 Architectures” - Mitra et al.
- “LLVM: A Compilation Framework for Lifelong Program Analysis & Transformation” - Lattner and Adve
- “Optimizing for Instruction Caches” - McFarling
H. Glossary of Terms
Term |
Definition |
Aliasing |
When multiple pointers can refer to the same memory location |
Auto-vectorization |
Compiler automatically transforms scalar code to use SIMD instructions |
Basic block |
Straight-line code sequence with no branches except at entry and exit |
Constant folding |
Computing constant expressions at compile time |
CSE (Common Subexpression Elimination) |
Avoiding redundant computations |
DCE (Dead Code Elimination) |
Removing code that doesn’t affect program output |
Function inlining |
Replacing a function call with the function’s body |
ILP (Instruction-Level Parallelism) |
Multiple instructions executed simultaneously |
JIT (Just-in-Time) compilation |
Compiling code during program execution |
LICM (Loop-Invariant Code Motion) |
Moving unchanged calculations outside loops |
LTO (Link-Time Optimization) |
Optimization across multiple compilation units |
PGO (Profile-Guided Optimization) |
Using runtime data to guide optimization |
Register allocation |
Assigning variables to CPU registers |
SIMD (Single Instruction, Multiple Data) |
Processing multiple data elements with one instruction |
Strength reduction |
Replacing expensive operations with cheaper ones |
TLB (Translation Lookaside Buffer) |
Cache for virtual-to-physical address mappings |
This appendix serves as a quick reference. For detailed explanation of these concepts, refer to the corresponding chapters in the book.