Chapter 4: The Mathematical Prowess of Compilers

Modern compilers are remarkably sophisticated when it comes to optimizing mathematical operations. They can transform seemingly simple arithmetic into highly efficient machine code, often outperforming hand-written assembly. This chapter explores how compilers optimize mathematical operations and how we can write code to take advantage of these optimizations.

The Power of the `lea` Instruction

The lea (Load Effective Address) instruction is one of the most versatile instructions in the x86-64 instruction set. While its primary purpose is address calculation, compilers have learned to use it for efficient arithmetic operations.

Basic `lea` Usage

// Original C code
int calculate(int x) {
    return x * 5 + 7;
}

// Compiler-generated assembly (GCC -O2)
calculate:
    lea    eax, [rdi + rdi*4]  ; x * 5
    add    eax, 7              ; + 7
    ret

The lea instruction here performs the multiplication by 5 in a single instruction, using the addressing mode [rdi + rdi*4] which is equivalent to x + x*4 = x*5.

Advanced `lea` Patterns

Compilers can use lea for more complex calculations:

// Original C code
int complex_calc(int x, int y) {
    return x * 9 + y * 7 + 3;
}

// Compiler-generated assembly (GCC -O2)
complex_calc:
    lea    eax, [rdi + rdi*8]    ; x * 9
    lea    ecx, [rsi + rsi*2]    ; y * 3
    lea    eax, [rax + rcx*2]    ; + y * 6
    add    eax, 3                ; + 3
    ret

The compiler breaks down the calculation into multiple lea instructions:

x * 9 using [rdi + rdi*8]
y * 7 as y * 3 + y * 4 using two lea instructions
Final addition of 3

Complete Working Example: Matrix Multiplication

Let’s examine how compilers optimize a real-world mathematical operation:

#include <stdio.h>
#include <stdlib.h>
#include <time.h>

#define N 1024

void matrix_multiply(int A[N][N], int B[N][N], int C[N][N]) {
    for (int i = 0; i < N; i++) {
        for (int j = 0; j < N; j++) {
            int sum = 0;
            for (int k = 0; k < N; k++) {
                sum += A[i][k] * B[k][j];
            }
            C[i][j] = sum;
        }
    }
}

int main() {
    // Allocate and initialize matrices
    int (*A)[N] = malloc(N * N * sizeof(int));
    int (*B)[N] = malloc(N * N * sizeof(int));
    int (*C)[N] = malloc(N * N * sizeof(int));
    
    // Initialize with random values
    srand(time(NULL));
    for (int i = 0; i < N; i++) {
        for (int j = 0; j < N; j++) {
            A[i][j] = rand() % 100;
            B[i][j] = rand() % 100;
        }
    }
    
    // Time the multiplication
    clock_t start = clock();
    matrix_multiply(A, B, C);
    clock_t end = clock();
    
    double time_taken = ((double)(end - start)) / CLOCKS_PER_SEC;
    printf("Matrix multiplication took %f seconds\n", time_taken);
    
    // Clean up
    free(A);
    free(B);
    free(C);
    
    return 0;
}

When compiled with optimizations (gcc -O3 -march=native), the compiler generates highly optimized code using:

Loop unrolling
SIMD instructions (AVX/AVX2)
Cache-friendly access patterns
Register allocation optimizations

Division and Modulus Optimization

Division and modulus operations are expensive on modern processors. Compilers employ several strategies to optimize these operations.

Division by Constants

// Original C code
int divide_by_13(int x) {
    return x / 13;
}

// Compiler-generated assembly (GCC -O2)
divide_by_13:
    mov    eax, edi
    mov    ecx, 13
    imul   rax, rcx
    sar    rax, 32
    mov    edx, edi
    sar    edx, 31
    sub    eax, edx
    ret

The compiler transforms the division into a multiplication by the reciprocal, followed by some adjustments. This is much faster than using the div instruction.

Complete Working Example: Prime Number Sieve

Let’s examine how compilers optimize a more complex mathematical operation:

#include <stdio.h>
#include <stdlib.h>
#include <math.h>
#include <time.h>

void sieve_of_eratosthenes(int limit) {
    unsigned char *is_prime = calloc(limit + 1, sizeof(unsigned char));
    if (!is_prime) {
        fprintf(stderr, "Memory allocation failed\n");
        return;
    }
    
    // Initialize array
    for (int i = 2; i <= limit; i++) {
        is_prime[i] = 1;
    }
    
    // Sieve
    for (int i = 2; i * i <= limit; i++) {
        if (is_prime[i]) {
            for (int j = i * i; j <= limit; j += i) {
                is_prime[j] = 0;
            }
        }
    }
    
    // Count primes
    int count = 0;
    for (int i = 2; i <= limit; i++) {
        if (is_prime[i]) {
            count++;
        }
    }
    
    printf("Found %d primes up to %d\n", count, limit);
    free(is_prime);
}

int main() {
    const int limit = 100000000;
    
    clock_t start = clock();
    sieve_of_eratosthenes(limit);
    clock_t end = clock();
    
    double time_taken = ((double)(end - start)) / CLOCKS_PER_SEC;
    printf("Sieve took %f seconds\n", time_taken);
    
    return 0;
}

The compiler optimizes this code by:

Using bit operations instead of modulus
Unrolling the inner loop
Optimizing memory access patterns
Using SIMD instructions where possible

Floating-Point Optimization

Floating-point operations present unique optimization challenges and opportunities.

Fused Multiply-Add (FMA)

Modern processors support FMA instructions that perform a * b + c in a single operation:

// Original C code
float polynomial(float x) {
    return 3.0f * x * x + 2.0f * x + 1.0f;
}

// Compiler-generated assembly (GCC -O3 -march=native)
polynomial:
    vmulss  xmm1, xmm0, xmm0    ; x * x
    vmovss  xmm2, DWORD PTR .LC0[rip]  ; 3.0
    vfmadd213ss xmm1, xmm2, xmm0  ; 3.0 * x^2 + x
    vaddss  xmm0, xmm1, DWORD PTR .LC1[rip]  ; + 1.0
    ret

The compiler uses FMA instructions to combine multiplication and addition, reducing instruction count and improving accuracy.

Complete Working Example: Fast Fourier Transform

Let’s examine a more complex floating-point operation:

#include <stdio.h>
#include <stdlib.h>
#include <math.h>
#include <complex.h>
#include <time.h>

#define PI 3.14159265358979323846

void fft(complex double *x, int n) {
    if (n <= 1) return;
    
    // Divide
    complex double *even = malloc(n/2 * sizeof(complex double));
    complex double *odd = malloc(n/2 * sizeof(complex double));
    
    for (int i = 0; i < n/2; i++) {
        even[i] = x[2*i];
        odd[i] = x[2*i+1];
    }
    
    // Conquer
    fft(even, n/2);
    fft(odd, n/2);
    
    // Combine
    for (int k = 0; k < n/2; k++) {
        complex double t = cexp(-2.0 * PI * I * k / n) * odd[k];
        x[k] = even[k] + t;
        x[k + n/2] = even[k] - t;
    }
    
    free(even);
    free(odd);
}

int main() {
    const int n = 1024;
    complex double *x = malloc(n * sizeof(complex double));
    
    // Initialize with random values
    srand(time(NULL));
    for (int i = 0; i < n; i++) {
        x[i] = (double)rand()/RAND_MAX + I * (double)rand()/RAND_MAX;
    }
    
    clock_t start = clock();
    fft(x, n);
    clock_t end = clock();
    
    double time_taken = ((double)(end - start)) / CLOCKS_PER_SEC;
    printf("FFT took %f seconds\n", time_taken);
    
    free(x);
    return 0;
}

The compiler optimizes this code by:

Using SIMD instructions for complex arithmetic
Optimizing memory access patterns
Inlining recursive calls where possible
Using FMA instructions for complex multiplication

Bit Manipulation Optimization

Compilers excel at optimizing bit manipulation operations:

// Original C code
int count_set_bits(uint32_t x) {
    int count = 0;
    while (x) {
        count += x & 1;
        x >>= 1;
    }
    return count;
}

// Compiler-generated assembly (GCC -O3 -mpopcnt)
count_set_bits:
    popcnt  eax, edi
    ret

The compiler recognizes this as a population count operation and uses the dedicated popcnt instruction.

Complete Working Example: Bitboard Operations

Let’s examine a more complex bit manipulation example:

#include <stdio.h>
#include <stdint.h>
#include <time.h>

typedef uint64_t Bitboard;

// Bitboard operations
Bitboard get_knight_attacks(int square) {
    static const Bitboard knight_moves[64] = {
        // Pre-calculated knight move bitboards
        0x0000000000020400, 0x0000000000050800, // ... and so on
    };
    return knight_moves[square];
}

Bitboard get_king_attacks(int square) {
    static const Bitboard king_moves[64] = {
        // Pre-calculated king move bitboards
        0x0000000000000302, 0x0000000000000705, // ... and so on
    };
    return king_moves[square];
}

// Population count using SWAR algorithm
int popcount(Bitboard bb) {
    bb = bb - ((bb >> 1) & 0x5555555555555555);
    bb = (bb & 0x3333333333333333) + ((bb >> 2) & 0x3333333333333333);
    bb = (bb + (bb >> 4)) & 0x0F0F0F0F0F0F0F0F;
    return (bb * 0x0101010101010101) >> 56;
}

// Find least significant bit
int lsb(Bitboard bb) {
    return __builtin_ctzll(bb);
}

// Find most significant bit
int msb(Bitboard bb) {
    return 63 - __builtin_clzll(bb);
}

int main() {
    Bitboard board = 0x0000000000000001;
    int total_moves = 0;
    
    clock_t start = clock();
    
    // Generate all possible knight moves from each square
    for (int i = 0; i < 64; i++) {
        Bitboard attacks = get_knight_attacks(i);
        total_moves += popcount(attacks);
    }
    
    clock_t end = clock();
    double time_taken = ((double)(end - start)) / CLOCKS_PER_SEC;
    
    printf("Generated %d knight moves in %f seconds\n", 
           total_moves, time_taken);
    
    return 0;
}

The compiler optimizes this code by:

Using built-in bit manipulation instructions
Optimizing the SWAR population count algorithm
Inlining small functions
Using SIMD instructions where applicable

Summary

Modern compilers are incredibly sophisticated at optimizing mathematical operations. They can:

Transform divisions into multiplications
Use specialized instructions like lea and FMA
Optimize bit manipulation operations
Generate SIMD code for vector operations
Optimize memory access patterns
Inline and unroll loops effectively

To take full advantage of these optimizations:

Write clear, straightforward code
Use constants where possible
Avoid unnecessary type conversions
Structure loops for optimal vectorization
Be aware of the target architecture’s capabilities

Remember that while compilers are powerful, they’re not perfect. Sometimes manual optimization is necessary, but it should always be guided by profiling data and a deep understanding of both the compiler’s capabilities and the target architecture.

The Power of the lea Instruction

Basic lea Usage

Advanced lea Patterns

Complete Working Example: Matrix Multiplication

Division and Modulus Optimization

Division by Constants

Complete Working Example: Prime Number Sieve

Floating-Point Optimization

Fused Multiply-Add (FMA)

Complete Working Example: Fast Fourier Transform

Bit Manipulation Optimization

Complete Working Example: Bitboard Operations

Summary

The Power of the `lea` Instruction

Basic `lea` Usage

Advanced `lea` Patterns