Chapter 13: Practical Tips for Compiler Optimization

Throughout this book, we’ve explored the inner workings of compilers and how they optimize code. In this chapter, we’ll synthesize that knowledge into practical, actionable tips that you can apply to your code today to help compilers generate more efficient executables.

Writing Compiler-Friendly Code

The way you express your algorithms and data structures can significantly impact how well the compiler can optimize them. Here are key principles for writing compiler-friendly code:

Avoid Pointer Aliasing

Pointer aliasing is one of the biggest barriers to compiler optimization, as we saw in Chapter 10:

// Bad: Potential aliasing prevents optimization
void update_arrays(float* a, float* b, int size) {
    for (int i = 0; i < size; i++) {
        a[i] = a[i] + b[i];  // a and b might overlap
    }
}

// Good: Use restrict keyword to inform the compiler
void update_arrays(float* restrict a, float* restrict b, int size) {
    for (int i = 0; i < size; i++) {
        a[i] = a[i] + b[i];  // Compiler now knows a and b don't overlap
    }
}

Keep Functions Simple and Focused

Simpler functions are easier for compilers to analyze and optimize:

// Hard to optimize - complex function doing too many things
int complex_function(int* data, int size, int flag) {
    int result = 0;
    for (int i = 0; i < size; i++) {
        if (flag == 0) {
            result += data[i] * data[i];
        } else if (flag == 1) {
            result += data[i];
        } else {
            result = result * data[i];
        }
    }
    return result;
}

// Better - split into simpler functions
int sum_squares(int* data, int size) {
    int result = 0;
    for (int i = 0; i < size; i++) {
        result += data[i] * data[i];
    }
    return result;
}

int sum_values(int* data, int size) {
    int result = 0;
    for (int i = 0; i < size; i++) {
        result += data[i];
    }
    return result;
}

int product_with_values(int* data, int size, int initial) {
    int result = initial;
    for (int i = 0; i < size; i++) {
        result = result * data[i];
    }
    return result;
}

Favor Regular Control Flow

Predictable control flow enables many compiler optimizations:

// Hard to optimize - unpredictable branching
int process_data(int* data, int size) {
    int sum = 0;
    for (int i = 0; i < size; i++) {
        if (data[i] % 17 == 0 || data[i] % 13 == 0) {
            sum += data[i] * 2;
        } else {
            sum += data[i] / 2;
        }
    }
    return sum;
}

// Better - branch-free version
int process_data_branchless(int* data, int size) {
    int sum = 0;
    for (int i = 0; i < size; i++) {
        int is_special = (data[i] % 17 == 0 || data[i] % 13 == 0);
        sum += is_special ? data[i] * 2 : data[i] / 2;
    }
    return sum;
}

Use the Right Data Types

Using appropriate data types helps the compiler generate efficient code:

// Bad: Using int for loop counter when processing a large array
void process_large_array(char* data, size_t size) {
    // May cause overflow on 32-bit systems with large arrays
    for (int i = 0; i < size; i++) {
        data[i] += 1;
    }
}

// Good: Using size_t for array indices
void process_large_array(char* data, size_t size) {
    for (size_t i = 0; i < size; i++) {
        data[i] += 1;
    }
}

// Good: Using appropriate integer types
uint8_t add_bytes(uint8_t a, uint8_t b) {
    return a + b;  // Compiler knows overflow behavior is defined
}

Compiler Directives and Attributes

Modern compilers provide various directives and attributes to guide optimization decisions:

Function Attributes

// GCC/Clang
// Tell compiler this function doesn't throw exceptions
__attribute__((nothrow)) void safe_function();

// Tell compiler this function is pure (no side effects)
__attribute__((pure)) int calculate(int x);

// Tell compiler this function is hot (frequently executed)
__attribute__((hot)) void critical_function();

// Force function inlining
__attribute__((always_inline)) inline int small_function();

// MSVC equivalent directives
__declspec(nothrow) void safe_function();
__forceinline int small_function();

Loop Directives

// GCC/Clang
void process_array(int* a, int* b, int size) {
    // Tell compiler the loop can be vectorized
    #pragma GCC ivdep
    for (int i = 0; i < size; i++) {
        a[i] = b[i] * 2;
    }
}

// OpenMP (widely supported)
void process_array(int* a, int* b, int size) {
    // Explicitly request SIMD vectorization
    #pragma omp simd
    for (int i = 0; i < size; i++) {
        a[i] = b[i] * 2;
    }
}

// MSVC
void process_array(int* a, int* b, int size) {
    // Hint that a and b don't alias
    __assume(a != b);
    for (int i = 0; i < size; i++) {
        a[i] = b[i] * 2;
    }
}

Complete Working Example: Compiler Directive Impact

#include <stdio.h>
#include <stdlib.h>
#include <time.h>
#include <string.h>

#define SIZE 100000000
#define ITERATIONS 10

// Baseline implementation
void multiply_add(float* a, float* b, float* c, float* result, int size) {
    for (int i = 0; i < size; i++) {
        result[i] = a[i] * b[i] + c[i];
    }
}

// Implementation with restrict
void multiply_add_restrict(float* restrict a, float* restrict b, 
                          float* restrict c, float* restrict result, int size) {
    for (int i = 0; i < size; i++) {
        result[i] = a[i] * b[i] + c[i];
    }
}

// Implementation with vectorization hint
void multiply_add_vector(float* restrict a, float* restrict b, 
                        float* restrict c, float* restrict result, int size) {
    #pragma GCC ivdep
    for (int i = 0; i < size; i++) {
        result[i] = a[i] * b[i] + c[i];
    }
}

int main() {
    // Allocate aligned memory for better SIMD performance
    float* a = aligned_alloc(32, SIZE * sizeof(float));
    float* b = aligned_alloc(32, SIZE * sizeof(float));
    float* c = aligned_alloc(32, SIZE * sizeof(float));
    float* result = aligned_alloc(32, SIZE * sizeof(float));
    
    // Initialize arrays
    for (int i = 0; i < SIZE; i++) {
        a[i] = (float)rand() / RAND_MAX;
        b[i] = (float)rand() / RAND_MAX;
        c[i] = (float)rand() / RAND_MAX;
    }
    
    // Benchmark baseline
    clock_t start = clock();
    for (int i = 0; i < ITERATIONS; i++) {
        multiply_add(a, b, c, result, SIZE);
    }
    clock_t end = clock();
    double baseline_time = ((double)(end - start)) / CLOCKS_PER_SEC;
    
    // Benchmark with restrict
    start = clock();
    for (int i = 0; i < ITERATIONS; i++) {
        multiply_add_restrict(a, b, c, result, SIZE);
    }
    end = clock();
    double restrict_time = ((double)(end - start)) / CLOCKS_PER_SEC;
    
    // Benchmark with vectorization hint
    start = clock();
    for (int i = 0; i < ITERATIONS; i++) {
        multiply_add_vector(a, b, c, result, SIZE);
    }
    end = clock();
    double vector_time = ((double)(end - start)) / CLOCKS_PER_SEC;
    
    // Report results
    printf("Multiply-add operation on %d elements, %d iterations:\n", 
           SIZE, ITERATIONS);
    printf("Baseline time: %.3f seconds\n", baseline_time);
    printf("With restrict: %.3f seconds (%.2fx speedup)\n", 
           restrict_time, baseline_time / restrict_time);
    printf("With vectorization hint: %.3f seconds (%.2fx speedup)\n", 
           vector_time, baseline_time / vector_time);
    
    // Clean up
    free(a);
    free(b);
    free(c);
    free(result);
    
    return 0;
}

Optimization Flag Selection

Choosing the right compiler flags can have a dramatic impact on performance:

GCC/Clang Optimization Levels

# No optimization - good for debugging
gcc -O0 program.c -o program

# Basic optimizations - fast compilation, decent performance
gcc -O1 program.c -o program

# More aggressive optimizations - good balance
gcc -O2 program.c -o program

# Aggressive optimizations including vectorization
gcc -O3 program.c -o program

# Size optimization
gcc -Os program.c -o program

# Maximum performance, may increase code size
gcc -Ofast program.c -o program

Targeted Architecture Options

# Enable all instructions available on the current CPU
gcc -march=native program.c -o program

# Target specific x86-64 feature level
gcc -march=x86-64-v4 program.c -o program

# Enable specific instruction sets
gcc -mavx2 -mfma program.c -o program

MSVC Optimization Flags

# Disable optimization (debug)
cl /Od program.c

# Minimize size
cl /O1 program.c

# Maximize speed
cl /O2 program.c

# Target specific instruction sets
cl /arch:AVX2 program.c

Complete Working Example: Compiler Flag Impact

#include <stdio.h>
#include <stdlib.h>
#include <time.h>
#include <math.h>

#define SIZE 10000000
#define ITERATIONS 10

// Function that benefits from vectorization
void vector_operations(float* a, float* b, float* result, int size) {
    for (int i = 0; i < size; i++) {
        result[i] = sqrtf(a[i]) * logf(b[i] + 1.0f);
    }
}

// Function that benefits from branch prediction
int conditional_sum(int* values, int size, int threshold) {
    int sum = 0;
    for (int i = 0; i < size; i++) {
        if (values[i] > threshold) {
            sum += values[i];
        } else {
            sum -= values[i];
        }
    }
    return sum;
}

// Function that benefits from loop unrolling
void matrix_multiply(float* A, float* B, float* C, int n) {
    for (int i = 0; i < n; i++) {
        for (int j = 0; j < n; j++) {
            float sum = 0.0f;
            for (int k = 0; k < n; k++) {
                sum += A[i*n + k] * B[k*n + j];
            }
            C[i*n + j] = sum;
        }
    }
}

int main() {
    // Allocate memory
    float* a = malloc(SIZE * sizeof(float));
    float* b = malloc(SIZE * sizeof(float));
    float* result = malloc(SIZE * sizeof(float));
    int* values = malloc(SIZE * sizeof(int));
    
    // Matrix size (smaller due to O(n³) complexity)
    int matrix_size = 500;
    float* A = malloc(matrix_size * matrix_size * sizeof(float));
    float* B = malloc(matrix_size * matrix_size * sizeof(float));
    float* C = malloc(matrix_size * matrix_size * sizeof(float));
    
    // Initialize data
    srand(time(NULL));
    for (int i = 0; i < SIZE; i++) {
        a[i] = (float)rand() / RAND_MAX * 10.0f;
        b[i] = (float)rand() / RAND_MAX * 10.0f;
        values[i] = rand() % 1000;
    }
    
    for (int i = 0; i < matrix_size * matrix_size; i++) {
        A[i] = (float)rand() / RAND_MAX;
        B[i] = (float)rand() / RAND_MAX;
    }
    
    printf("Benchmarking with current compiler flags:\n");
    
    // Benchmark vector operations
    clock_t start = clock();
    for (int i = 0; i < ITERATIONS; i++) {
        vector_operations(a, b, result, SIZE);
    }
    clock_t end = clock();
    double vector_time = ((double)(end - start)) / CLOCKS_PER_SEC;
    
    // Benchmark conditional operations
    start = clock();
    int sum = 0;
    for (int i = 0; i < ITERATIONS; i++) {
        sum += conditional_sum(values, SIZE, 500);
    }
    end = clock();
    double conditional_time = ((double)(end - start)) / CLOCKS_PER_SEC;
    
    // Benchmark matrix multiplication
    start = clock();
    for (int i = 0; i < ITERATIONS / 5; i++) {  // Reduced iterations due to cost
        matrix_multiply(A, B, C, matrix_size);
    }
    end = clock();
    double matrix_time = ((double)(end - start)) / CLOCKS_PER_SEC;
    
    // Report results
    printf("Vector operations: %.3f seconds\n", vector_time);
    printf("Conditional operations: %.3f seconds\n", conditional_time);
    printf("Matrix multiplication: %.3f seconds\n", matrix_time);
    
    printf("\nCompile this program with different flags to compare performance:\n");
    printf("  -O0: No optimization\n");
    printf("  -O1: Basic optimizations\n");
    printf("  -O2: Moderate optimizations\n");
    printf("  -O3: Aggressive optimizations with vectorization\n");
    printf("  -Ofast: Maximum performance (may affect precision)\n");
    printf("  -march=native: Target current CPU architecture\n");
    
    // Clean up
    free(a);
    free(b);
    free(result);
    free(values);
    free(A);
    free(B);
    free(C);
    
    return 0;
}

Profile-Guided Optimization (PGO)

Profile-guided optimization uses runtime information to guide compiler optimizations:

GCC PGO Example

# Step 1: Compile with instrumentation
gcc -O2 -fprofile-generate program.c -o program

# Step 2: Run the program with representative workloads
./program typical_input1
./program typical_input2

# Step 3: Compile with profiling data
gcc -O2 -fprofile-use program.c -o program_optimized

MSVC PGO Example

# Step 1: Compile with instrumentation
cl /O2 /GL /LTCG:PGINSTRUMENT program.c /Fe:program.exe

# Step 2: Run with representative workloads
program.exe typical_input1
program.exe typical_input2

# Step 3: Compile with profiling data
cl /O2 /GL /LTCG:PGOPTIMIZE program.c /Fe:program_optimized.exe

Link-Time Optimization (LTO)

Link-time optimization enables cross-module optimizations:

GCC LTO Example

# Compile and link with LTO enabled
gcc -O2 -flto file1.c file2.c -o program

# Compile object files with LTO
gcc -O2 -flto -c file1.c
gcc -O2 -flto -c file2.c
gcc -O2 -flto file1.o file2.o -o program

MSVC LTO Example

# Enable link-time code generation
cl /O2 /GL file1.c file2.c /link /LTCG

Identifying Optimization Opportunities

To find portions of your code that would benefit most from optimization:

Use Profiling Tools

# GCC profiling with gprof
gcc -pg program.c -o program
./program
gprof program > profile.txt

# MSVC profiling
cl /Zi program.c
devenv program.exe /profile

# Linux perf
perf record ./program
perf report

Compiler Optimization Reports

# GCC optimization reports
gcc -O3 -fopt-info-vec-all program.c -o program

# Clang optimization reports
clang -O3 -Rpass=loop-vectorize program.c -o program

# MSVC optimization reports
cl /O2 /Qvec-report:2 program.c

Common Bottlenecks and Solutions

Memory Access Patterns

// Bad: Column-major access in a row-major language
void transpose_bad(float matrix[N][N]) {
    for (int i = 0; i < N; i++) {
        for (int j = 0; j < N; j++) {
            matrix[j][i] = matrix[i][j];  // Poor cache usage
        }
    }
}

// Good: Cache-friendly blocking
void transpose_good(float matrix[N][N]) {
    for (int i = 0; i < N; i += BLOCK_SIZE) {
        for (int j = 0; j < N; j += BLOCK_SIZE) {
            // Process block
            for (int ii = i; ii < i + BLOCK_SIZE && ii < N; ii++) {
                for (int jj = j; jj < j + BLOCK_SIZE && jj < N; jj++) {
                    float temp = matrix[ii][jj];
                    matrix[ii][jj] = matrix[jj][ii];
                    matrix[jj][ii] = temp;
                }
            }
        }
    }
}

Branch Mispredictions

// Bad: Unpredictable branch
int sum_filtered(int* data, int size) {
    int sum = 0;
    for (int i = 0; i < size; i++) {
        if (data[i] % 17 == 0) {  // Unpredictable
            sum += data[i];
        }
    }
    return sum;
}

// Good: Branchless alternative
int sum_filtered_branchless(int* data, int size) {
    int sum = 0;
    for (int i = 0; i < size; i++) {
        int matches = (data[i] % 17 == 0);  // 0 or 1
        sum += data[i] * matches;
    }
    return sum;
}

Function Call Overhead

// Bad: Excessive function calls in a hot loop
float process_points(Point* points, int count) {
    float sum = 0;
    for (int i = 0; i < count; i++) {
        sum += distance(points[i], ORIGIN);  // Function call for each point
    }
    return sum;
}

// Good: Inline the calculation
float process_points_inline(Point* points, int count) {
    float sum = 0;
    for (int i = 0; i < count; i++) {
        // Direct calculation
        float dx = points[i].x - ORIGIN.x;
        float dy = points[i].y - ORIGIN.y;
        sum += sqrt(dx*dx + dy*dy);
    }
    return sum;
}

Platform-Specific Optimizations

x86-64 SIMD Intrinsics

#include <immintrin.h>

// Using AVX2 intrinsics
void multiply_arrays_avx2(float* a, float* b, float* result, int size) {
    for (int i = 0; i < size; i += 8) {
        __m256 va = _mm256_loadu_ps(&a[i]);
        __m256 vb = _mm256_loadu_ps(&b[i]);
        __m256 vresult = _mm256_mul_ps(va, vb);
        _mm256_storeu_ps(&result[i], vresult);
    }
}

ARM NEON Intrinsics

#include <arm_neon.h>

// Using ARM NEON intrinsics
void multiply_arrays_neon(float* a, float* b, float* result, int size) {
    for (int i = 0; i < size; i += 4) {
        float32x4_t va = vld1q_f32(&a[i]);
        float32x4_t vb = vld1q_f32(&b[i]);
        float32x4_t vresult = vmulq_f32(va, vb);
        vst1q_f32(&result[i], vresult);
    }
}

Standard Library Optimization

Using Optimized Libraries

// Instead of custom implementation:
void matrix_multiply(float* A, float* B, float* C, int n) {
    // Your unoptimized implementation
}

// Use optimized BLAS implementation:
#include <cblas.h>

void matrix_multiply_optimized(float* A, float* B, float* C, int n) {
    cblas_sgemm(CblasRowMajor, CblasNoTrans, CblasNoTrans,
                n, n, n, 1.0f, A, n, B, n, 0.0f, C, n);
}

String and Memory Operations

// Instead of manual implementation:
void copy_buffer(char* dst, const char* src, size_t size) {
    for (size_t i = 0; i < size; i++) {
        dst[i] = src[i];
    }
}

// Use optimized standard library function:
memcpy(dst, src, size);

Summary

Effective compiler optimization requires a partnership between you and the compiler:

Write Compiler-Friendly Code
- Avoid pointer aliasing
- Keep functions simple and focused
- Use regular control flow and appropriate data types
- Organize data structures for efficient memory access
Provide Additional Information
- Use language features like restrict, const, and inline
- Apply compiler directives and pragmas
- Use attributes to convey function properties
Choose the Right Compilation Strategy
- Select appropriate optimization flags
- Consider profile-guided optimization
- Leverage link-time optimization
- Target specific CPU architectures
Identify and Address Performance Bottlenecks
- Use profiling tools to find hotspots
- Analyze compiler optimization reports
- Focus on memory access patterns, branch prediction, and function call overhead
Balance Optimization and Maintainability
- Ensure optimized code remains readable and maintainable
- Document optimization techniques and their rationale
- Consider platform-specific optimizations when necessary

Remember that the most effective optimizations come from understanding both your code and how the compiler works. By following these practical tips, you can help the compiler generate more efficient code while maintaining readability and portability.