Appendix: Compiler Optimization Reference

This appendix provides a consolidated reference for compiler optimization concepts, flags, and techniques discussed throughout the book.

A. Common Compiler Optimization Flags

GCC/Clang Optimization Flags

Flag	Description
`-O0`	No optimization (default), fastest compilation, best for debugging
`-O1`	Basic optimizations, reasonable compilation speed and performance
`-O2`	More aggressive optimizations without significant space/speed tradeoffs
`-O3`	Maximum optimization including vectorization, function inlining
`-Os`	Optimize for size rather than speed
`-Ofast`	-O3 plus optimizations that may violate strict standards compliance
`-flto`	Enable link-time optimization
`-fprofile-generate`	Instrument code for profile-guided optimization (first stage)
`-fprofile-use`	Use collected profile data for optimization (second stage)
`-march=native`	Optimize for the CPU architecture of the compiling machine
`-mtune=native`	Tune code for the CPU architecture without using newer instructions
`-ffast-math`	Enable aggressive floating-point optimizations (may affect precision)
`-funroll-loops`	Unroll loops when beneficial
`-fomit-frame-pointer`	Don’t keep frame pointer in register (frees up register)
`-fvectorize`	Enable auto-vectorization (on by default in -O3)

MSVC Optimization Flags

Flag	Description
`/Od`	Disable optimization (debug builds)
`/O1`	Minimize size
`/O2`	Maximize speed (default for release builds)
`/Ox`	Maximum optimizations
`/Os`	Favor small code
`/Ot`	Favor fast code
`/GL`	Enable whole program optimization
`/arch:AVX2`	Generate code using AVX2 instructions
`/fp:fast`	Enable aggressive floating-point optimizations
`/Qpar`	Enable auto-parallelization
`/Qvec-report:2`	Report auto-vectorization results
`/LTCG`	Link-time code generation
`/LTCG:PGINSTRUMENT`	Instrument code for profile-guided optimization
`/LTCG:PGOPTIMIZE`	Use collected profile data for optimization

B. Compiler Attributes and Pragmas

GCC/Clang Attributes

Attribute	Description	Example
`__attribute__((pure))`	Function has no side effects	`__attribute__((pure)) int calculate(int x);`
`__attribute__((const))`	Function depends only on arguments	`__attribute__((const)) int square(int x);`
`__attribute__((hot))`	Function is frequently executed	`__attribute__((hot)) void critical_function();`
`__attribute__((cold))`	Function is rarely executed	`__attribute__((cold)) void error_handler();`
`__attribute__((always_inline))`	Always inline this function	`__attribute__((always_inline)) inline void fast_fn();`
`__attribute__((noinline))`	Never inline this function	`__attribute__((noinline)) void dont_inline();`
`__attribute__((aligned(N)))`	Align variable to N bytes	`__attribute__((aligned(16))) float vector[4];`
`__attribute__((packed))`	No padding in structure	`struct __attribute__((packed)) PackedStruct { ... };`
`__attribute__((noreturn))`	Function never returns	`__attribute__((noreturn)) void exit_program();`

MSVC Attributes

Attribute	Description	Example
`__declspec(noinline)`	Never inline this function	`__declspec(noinline) void function();`
`__declspec(noreturn)`	Function never returns	`__declspec(noreturn) void terminate();`
`__forceinline`	Force function inlining	`__forceinline int fast_function();`
`__declspec(align(N))`	Align variable to N bytes	`__declspec(align(16)) float vector[4];`
`__declspec(restrict)`	Pointer doesn’t alias	`void process(__declspec(restrict) int* data);`
`__declspec(novtable)`	Class with no vtable instances	`class __declspec(novtable) Interface {...};`

Common Pragma Directives

Pragma	Description	Example
`#pragma once`	Include guard (alternative to ifdef)	`#pragma once`
`#pragma pack(N)`	Set structure alignment	`#pragma pack(1)`
`#pragma omp parallel`	OpenMP parallel region	`#pragma omp parallel for`
`#pragma GCC ivdep`	Assume no loop-carried dependencies	`#pragma GCC ivdep`
`#pragma unroll`	Suggest loop unrolling	`#pragma unroll(4)`
`#pragma GCC push_options`	Save optimization settings	`#pragma GCC push_options`
`#pragma GCC optimize`	Set optimization level locally	`#pragma GCC optimize("O3")`

C. Optimization Techniques Summary

Loop Optimizations

Technique	Description	Performance Impact
Loop unrolling	Replicate loop body to reduce loop overhead	Reduces branch mispredictions, increases instruction-level parallelism
Loop interchange	Change nested loop order for better memory access	Improves cache utilization
Loop fusion	Combine multiple loops over the same data	Reduces loop overhead, improves cache utilization
Loop fission	Split complex loops into simpler ones	Can improve instruction cache, enable vectorization
Loop-invariant code motion	Move computations out of loops	Reduces redundant computation
Loop tiling/blocking	Process data in cache-friendly chunks	Dramatically improves cache performance
Loop unswitching	Move conditionals outside loops	Eliminates branch mispredictions
Loop peeling	Handle edge cases separately	Enables more aggressive optimizations

Memory Optimizations

Technique	Description	Performance Impact
Dead store elimination	Remove writes never read	Reduces memory traffic
Load/store forwarding	Replace loads with already computed values	Reduces memory dependencies
Memory access coalescing	Combine multiple accesses to same cache line	Reduces memory traffic
Scalar replacement	Replace memory accesses with register accesses	Faster access, reduces pressure on memory system
Structure splitting	Split large structures for better access patterns	Improves cache utilization
Structure reordering	Arrange structure fields by access frequency	Improves cache utilization
Buffer padding	Add padding to avoid false sharing	Reduces cache coherence traffic

Function Optimizations

Technique	Description	Performance Impact
Function inlining	Replace function call with body	Reduces call overhead, enables other optimizations
Tail call optimization	Optimize recursive calls in tail position	Reduces stack usage
Interprocedural optimization	Optimize across function boundaries	Enables holistic optimizations
Function specialization	Create optimized versions for specific cases	Better optimized code paths
Devirtualization	Replace virtual calls with direct calls	Eliminates indirection, enables inlining
Cross-module optimization	Optimize across compilation units	Enables broader optimizations

SIMD Optimizations

Technique	Description	Performance Impact
Auto-vectorization	Automatically use SIMD instructions	Process multiple elements in parallel
Loop vectorization	Transform scalar loops to vector operations	2-16x speedup depending on data width
SLP vectorization	Vectorize straight-line code	Improves throughput for sequential operations
Mixed-width vectorization	Utilize different vector widths	Better resource utilization
Vector predication	Handle partial vector operations	Enables vectorization with conditionals
Permutation optimization	Minimize vector shuffling	Reduces overhead in vector code

D. Common Optimization Barriers

Barrier	Description	Mitigation
Pointer aliasing	Uncertainty if pointers refer to same memory	Use `restrict` keyword or equivalent
Complex control flow	Many branches that depend on input	Simplify logic, use data-oriented design
Function calls	Especially virtual or indirect calls	Consider inlining or devirtualization
Memory dependencies	Data hazards between operations	Reorder operations when safe
Exception handling	Try/catch blocks limit optimization	Minimize use in performance-critical paths
Non-contiguous memory access	Cache-unfriendly memory patterns	Restructure data for better locality
Thread synchronization	Locks and atomic operations	Minimize synchronization in hot paths
I/O operations	System calls and external interactions	Buffer I/O outside performance-critical sections

E. Assembly Instruction Reference

Below is a quick reference for common x86-64 instruction types relevant to performance optimization.

Core Instructions

Instruction Type	Examples	Performance Notes
Data movement	MOV, MOVSX, MOVZX	Register-to-register fastest
Arithmetic	ADD, SUB, MUL, DIV	Integer MUL ~3-5 cycles, DIV ~10-20 cycles
Logical	AND, OR, XOR, NOT	Single-cycle latency
Shifts/rotates	SHL, SHR, ROL, ROR	Single-cycle for constant shifts
Branches	JMP, JE, JNE, JG, JL	Mispredictions cost ~15-20 cycles
Function	CALL, RET	Can disrupt pipeline and instruction cache
Stack	PUSH, POP	Combine multiple into single stack adjustment when possible

SIMD Instructions

Instruction Set	Examples	Width
SSE	MOVAPS, ADDPS, MULPS	128-bit (4 floats)
AVX	VMOVAPS, VADDPS, VMULPS	256-bit (8 floats)
AVX-512	VMOVAPS, VADDPS, VMULPS	512-bit (16 floats)
AVX-512 Mask	VADDPS{k}, VMULPS{k}	Conditional operations
FMA	VFMADD213PS	Fused multiply-add (better precision)

F. Profiling Tools

Tool	Platform	Description
perf	Linux	Sampling-based system profiler
Valgrind	Linux, macOS	Instrumentation-based profiler
gprof	Cross-platform	Function-level profiling
Intel VTune	Cross-platform	Advanced profiling for Intel CPUs
AMD uProf	Cross-platform	Profiling for AMD processors
Visual Studio Profiler	Windows	Integrated profiling in Visual Studio
Instruments	macOS	Apple’s profiling toolkit
Tracy	Cross-platform	Frame profiler for games/graphics
Flame Graphs	Various	Visualization technique for hierarchical profiles

G. Additional Resources

Books

“Optimizing Software in C++” by Agner Fog
“Computer Systems: A Programmer’s Perspective” by Bryant and O’Hallaron
“Engineering a Compiler” by Cooper and Torczon
“Performance Analysis and Tuning on Modern CPUs” by Denis Bakhvalov

Online References

Compiler Explorer (Godbolt) - Interactive compiler output explorer
Agner Fog’s Optimization Manuals - Detailed x86 optimization guides
Intel Intrinsics Guide - Reference for SIMD intrinsics
LLVM Blog - Information on cutting-edge compiler technology

Papers

“Automatic SIMD Vectorization of Fast Fourier Transforms for the ARM and x86 Architectures” - Mitra et al.
“LLVM: A Compilation Framework for Lifelong Program Analysis & Transformation” - Lattner and Adve
“Optimizing for Instruction Caches” - McFarling

H. Glossary of Terms

Term	Definition
Aliasing	When multiple pointers can refer to the same memory location
Auto-vectorization	Compiler automatically transforms scalar code to use SIMD instructions
Basic block	Straight-line code sequence with no branches except at entry and exit
Constant folding	Computing constant expressions at compile time
CSE (Common Subexpression Elimination)	Avoiding redundant computations
DCE (Dead Code Elimination)	Removing code that doesn’t affect program output
Function inlining	Replacing a function call with the function’s body
ILP (Instruction-Level Parallelism)	Multiple instructions executed simultaneously
JIT (Just-in-Time) compilation	Compiling code during program execution
LICM (Loop-Invariant Code Motion)	Moving unchanged calculations outside loops
LTO (Link-Time Optimization)	Optimization across multiple compilation units
PGO (Profile-Guided Optimization)	Using runtime data to guide optimization
Register allocation	Assigning variables to CPU registers
SIMD (Single Instruction, Multiple Data)	Processing multiple data elements with one instruction
Strength reduction	Replacing expensive operations with cheaper ones
TLB (Translation Lookaside Buffer)	Cache for virtual-to-physical address mappings

This appendix serves as a quick reference. For detailed explanation of these concepts, refer to the corresponding chapters in the book.