Chapter 8: Profiling and Optimizing Hotspots

Identify and eliminate performance bottlenecks in your tensor code. Learn systematic approaches to profiling and optimization that yield substantial speedups.

You Will Learn To…

  • Identify performance bottlenecks in tensor code using profiling tools
  • Distinguish between memory-bound and compute-bound operations
  • Apply targeted optimizations to critical code paths
  • Measure and validate performance improvements
  • Develop a systematic approach to performance tuning

Introduction

I once spent two weeks optimizing a tensor convolution operation that was the bottleneck in an image processing pipeline. The initial implementation was clean, readable, and painfully slow. After profiling, I discovered that 87% of execution time was spent in a single nested loop with poor cache utilization. A combination of loop tiling, data layout changes, and compiler directive tuning resulted in a 19x speedup—without touching 90% of the codebase.

That’s the reality of high-performance tensor programming: a small fraction of your code typically consumes the vast majority of execution time. This chapter is about finding those critical sections and systematically improving them.

Understanding Performance Bottlenecks

Before diving into profiling tools, let’s understand the two fundamental types of performance bottlenecks in tensor operations:

Compute-Bound Operations

Compute-bound operations are limited by CPU processing power. The CPU can’t process instructions fast enough to keep up with data availability. Examples include:

  • Complex mathematical functions (exp, log, trigonometric functions)
  • Dense matrix multiplication with small matrices that fit in cache
  • Operations with high arithmetic intensity (many calculations per memory access)

Memory-Bound Operations

Memory-bound operations are limited by memory bandwidth or latency. The CPU spends time waiting for data to arrive from memory. Examples include:

  • Large tensor operations that don’t fit in cache
  • Operations with poor spatial locality (scattered memory access patterns)
  • Operations with low arithmetic intensity (few calculations per memory access)

Most tensor operations in practice are memory-bound, especially for large tensors. This is why memory layout and access patterns are critical for performance.

Profiling Tools for C Tensor Programs

Using gprof for Function-Level Profiling

GNU’s gprof provides a straightforward way to identify which functions consume the most time:

# Compile with profiling information
gcc -pg -g -O2 tensor_ops.c -o tensor_program -lm

# Run the program (creates gmon.out)
./tensor_program

# Generate and view the profile
gprof ./tensor_program gmon.out > profile.txt
less profile.txt

A typical gprof output looks like this:

Flat profile:

Each sample counts as 0.01 seconds.
  %   cumulative   self              self     total           
 time   seconds   seconds    calls  ms/call  ms/call  name    
 68.32      0.41     0.41     1000     0.41     0.41  tensor_matmul
 21.67      0.54     0.13    10000     0.01     0.01  tensor_add
  6.67      0.58     0.04     1000     0.04     0.04  tensor_transpose
  3.34      0.60     0.02     3000     0.01     0.01  create_tensor

This output tells us that tensor_matmul is consuming 68.32% of the execution time, making it the primary optimization target.

Using perf for Detailed CPU Analysis

Linux’s perf tool provides deeper insights, including cache misses and branch prediction failures:

# Compile with debug symbols but optimized code
gcc -g -O2 tensor_ops.c -o tensor_program -lm

# Run perf record
perf record -g ./tensor_program

# Analyze the results
perf report

A sample perf report might look like:

# Samples: 3K of event 'cycles'
# Event count (approx.): 2,847,558,934
#
# Overhead  Command          Symbol
# ........  ...............  ..............................
#
    64.32%  tensor_program   tensor_matmul
            |--62.14%--tensor_matmul
            |          |--58.21%--0x4008f2
            |          |--3.93%--[...]  
    19.45%  tensor_program   tensor_add
    8.12%   tensor_program   [kernel.kallsyms]
    5.67%   tensor_program   tensor_transpose

For more detailed cache behavior, you can use specific events:

perf stat -e cycles,instructions,cache-references,cache-misses ./tensor_program

This might produce output like:

 Performance counter stats for './tensor_program':

      2,847,558,934      cycles
      4,921,223,611      instructions     #    1.73  insn per cycle
        324,562,371      cache-references
        128,945,221      cache-misses     #   39.7% of all cache refs

       1.023552562 seconds time elapsed

The high cache miss rate (39.7%) suggests this is a memory-bound application that could benefit from improved memory access patterns.

Using Cachegrind for Cache Simulation

Valgrind’s Cachegrind tool simulates the CPU cache to provide detailed information about cache behavior:

valgrind --tool=cachegrind ./tensor_program
cachegrind_annotate cachegrind.out.12345

This produces a detailed report of cache misses by function and line number:

--------------------------------------------------------------------------------
-- Auto-annotated source: tensor_ops.c
--------------------------------------------------------------------------------
      Ir        Dr       D1mr      DLmr       Dw       D1mw      DLmw  

       .         .         .         .         .         .         .    void tensor_matmul(Tensor *result, const Tensor *a, const Tensor *b) {
       .         .         .         .         .         .         .        int m = a->dims[0];
       .         .         .         .         .         .         .        int n = a->dims[1];
       .         .         .         .         .         .         .        int p = b->dims[1];
       .         .         .         .         .         .         .    
       .         .         .         .         .         .         .        for (int i = 0; i < m; i++) {
       .         .         .         .         .         .         .            for (int j = 0; j < p; j++) {
       .         .         .         .         .         .         .                float sum = 0.0f;
 9,600,000 4,800,000   600,000   600,000         .         .         .                for (int k = 0; k < n; k++) {
 9,600,000 9,600,000 1,200,000   600,000         .         .         .                    sum += a->data[i * n + k] * b->data[k * p + j];
       .         .         .         .         .         .         .                }
       .         .         .         .    960,000   120,000   120,000                result->data[i * p + j] = sum;
       .         .         .         .         .         .         .            }
       .         .         .         .         .         .         .        }
       .         .         .         .         .         .         .    }

This shows that the inner loop of tensor_matmul has a high number of cache misses (D1mr and DLmr columns), particularly when accessing b->data[k * p + j], which has a strided access pattern.

Optimizing Memory-Bound Operations

Now that we’ve identified our bottlenecks, let’s apply targeted optimizations for memory-bound operations.

Loop Tiling (Blocking)

Loop tiling improves cache utilization by operating on small blocks of data that fit in cache:

// Original matrix multiplication (poor cache utilization)
void tensor_matmul_naive(Tensor *result, const Tensor *a, const Tensor *b) {
    int m = a->dims[0];
    int n = a->dims[1];
    int p = b->dims[1];
    
    for (int i = 0; i < m; i++) {
        for (int j = 0; j < p; j++) {
            float sum = 0.0f;
            for (int k = 0; k < n; k++) {
                sum += a->data[i * n + k] * b->data[k * p + j];
            }
            result->data[i * p + j] = sum;
        }
    }
}

// Tiled matrix multiplication (better cache utilization)
void tensor_matmul_tiled(Tensor *result, const Tensor *a, const Tensor *b) {
    int m = a->dims[0];
    int n = a->dims[1];
    int p = b->dims[1];
    
    // Clear result tensor
    memset(result->data, 0, m * p * sizeof(float));
    
    // Choose tile sizes based on cache size
    // These values should be tuned for your specific hardware
    const int TILE_M = 32;
    const int TILE_N = 32;
    const int TILE_P = 32;
    
    for (int i0 = 0; i0 < m; i0 += TILE_M) {
        int i_end = i0 + TILE_M < m ? i0 + TILE_M : m;
        
        for (int j0 = 0; j0 < p; j0 += TILE_P) {
            int j_end = j0 + TILE_P < p ? j0 + TILE_P : p;
            
            for (int k0 = 0; k0 < n; k0 += TILE_N) {
                int k_end = k0 + TILE_N < n ? k0 + TILE_N : n;
                
                // Process a tile
                for (int i = i0; i < i_end; i++) {
                    for (int j = j0; j < j_end; j++) {
                        float sum = result->data[i * p + j];
                        
                        for (int k = k0; k < k_end; k++) {
                            sum += a->data[i * n + k] * b->data[k * p + j];
                        }
                        
                        result->data[i * p + j] = sum;
                    }
                }
            }
        }
    }
}

The tiled version processes small blocks of the matrices at a time, improving cache locality. The tile sizes should be tuned based on your CPU’s cache size.

Data Layout Transformation

Changing the data layout can significantly improve memory access patterns:

// Original row-major matrix multiplication (poor cache locality for B)
void tensor_matmul_row_major(Tensor *result, const Tensor *a, const Tensor *b) {
    int m = a->dims[0];
    int n = a->dims[1];
    int p = b->dims[1];
    
    for (int i = 0; i < m; i++) {
        for (int j = 0; j < p; j++) {
            float sum = 0.0f;
            for (int k = 0; k < n; k++) {
                // a->data[i * n + k] has good spatial locality
                // b->data[k * p + j] has poor spatial locality (strided access)
                sum += a->data[i * n + k] * b->data[k * p + j];
            }
            result->data[i * p + j] = sum;
        }
    }
}

// Transposed matrix multiplication (better cache locality)
void tensor_matmul_transposed(Tensor *result, const Tensor *a, const Tensor *b) {
    int m = a->dims[0];
    int n = a->dims[1];
    int p = b->dims[1];
    
    // Create a transposed copy of B for better cache locality
    Tensor *b_trans = create_tensor(p, n);
    for (int i = 0; i < n; i++) {
        for (int j = 0; j < p; j++) {
            b_trans->data[j * n + i] = b->data[i * p + j];
        }
    }
    
    // Perform multiplication with transposed B
    for (int i = 0; i < m; i++) {
        for (int j = 0; j < p; j++) {
            float sum = 0.0f;
            for (int k = 0; k < n; k++) {
                // Both accesses now have good spatial locality
                sum += a->data[i * n + k] * b_trans->data[j * n + k];
            }
            result->data[i * p + j] = sum;
        }
    }
    
    free_tensor(b_trans);
}

By transposing matrix B, we convert the strided access pattern into a contiguous one, significantly improving cache utilization.

Loop Interchange

Changing the order of nested loops can improve memory access patterns:

// Original loop order (poor spatial locality)
void tensor_transpose_naive(Tensor *result, const Tensor *input) {
    int rows = input->dims[0];
    int cols = input->dims[1];
    
    for (int i = 0; i < rows; i++) {
        for (int j = 0; j < cols; j++) {
            // Strided writes to result
            result->data[j * rows + i] = input->data[i * cols + j];
        }
    }
}

// Interchanged loop order (better spatial locality)
void tensor_transpose_interchanged(Tensor *result, const Tensor *input) {
    int rows = input->dims[0];
    int cols = input->dims[1];
    
    for (int j = 0; j < cols; j++) {
        for (int i = 0; i < rows; i++) {
            // Sequential writes to result
            result->data[j * rows + i] = input->data[i * cols + j];
        }
    }
}

By interchanging the loops, we convert strided memory accesses to sequential ones, improving cache utilization.

Loop Unrolling

Unrolling loops reduces branch mispredictions and allows for better instruction-level parallelism:

// Original loop
void tensor_scale_naive(Tensor *result, const Tensor *input, float scale) {
    int size = input->dims[0] * input->dims[1];
    
    for (int i = 0; i < size; i++) {
        result->data[i] = input->data[i] * scale;
    }
}

// Unrolled loop (4x)
void tensor_scale_unrolled(Tensor *result, const Tensor *input, float scale) {
    int size = input->dims[0] * input->dims[1];
    int i = 0;
    
    // Process 4 elements at a time
    for (; i < size - 3; i += 4) {
        result->data[i] = input->data[i] * scale;
        result->data[i+1] = input->data[i+1] * scale;
        result->data[i+2] = input->data[i+2] * scale;
        result->data[i+3] = input->data[i+3] * scale;
    }
    
    // Handle remaining elements
    for (; i < size; i++) {
        result->data[i] = input->data[i] * scale;
    }
}

Loop unrolling reduces branch overhead and allows the compiler to better schedule instructions.

Optimizing Compute-Bound Operations

For compute-bound operations, focus on reducing the number of calculations or using more efficient algorithms.

Fast Approximations

For mathematical functions like exp, log, or trigonometric functions, consider using faster approximations:

// Original implementation using standard math functions
void tensor_sigmoid_naive(Tensor *result, const Tensor *input) {
    int size = input->dims[0] * input->dims[1];
    
    for (int i = 0; i < size; i++) {
        // exp() is computationally expensive
        result->data[i] = 1.0f / (1.0f + expf(-input->data[i]));
    }
}

// Fast approximation of sigmoid function
float fast_sigmoid(float x) {
    // Piecewise approximation of sigmoid
    if (x < -5.0f) return 0.0f;
    if (x > 5.0f) return 1.0f;
    
    // Approximation in the range [-5, 5]
    float x2 = x * x;
    float e = 1.0f + fabsf(x) + x2 * 0.555f + x2 * x2 * 0.143f;
    
    return (x > 0.0f) ? (e / (e + 1.0f)) : (1.0f / (e + 1.0f));
}

void tensor_sigmoid_fast(Tensor *result, const Tensor *input) {
    int size = input->dims[0] * input->dims[1];
    
    for (int i = 0; i < size; i++) {
        result->data[i] = fast_sigmoid(input->data[i]);
    }
}

The fast approximation sacrifices some accuracy for significant performance gains.

Strength Reduction

Replacing expensive operations with cheaper ones can improve performance:

// Original implementation with division
void tensor_normalize_naive(Tensor *result, const Tensor *input) {
    int rows = input->dims[0];
    int cols = input->dims[1];
    
    for (int i = 0; i < rows; i++) {
        // Find maximum in this row
        float max_val = 0.0f;
        for (int j = 0; j < cols; j++) {
            float val = fabsf(input->data[i * cols + j]);
            if (val > max_val) max_val = val;
        }
        
        // Normalize row by maximum (division is expensive)
        for (int j = 0; j < cols; j++) {
            result->data[i * cols + j] = input->data[i * cols + j] / max_val;
        }
    }
}

// Optimized implementation with multiplication
void tensor_normalize_optimized(Tensor *result, const Tensor *input) {
    int rows = input->dims[0];
    int cols = input->dims[1];
    
    for (int i = 0; i < rows; i++) {
        // Find maximum in this row
        float max_val = 0.0f;
        for (int j = 0; j < cols; j++) {
            float val = fabsf(input->data[i * cols + j]);
            if (val > max_val) max_val = val;
        }
        
        // Precompute reciprocal (one division instead of many)
        float recip_max = 1.0f / max_val;
        
        // Normalize row by maximum (multiplication is cheaper than division)
        for (int j = 0; j < cols; j++) {
            result->data[i * cols + j] = input->data[i * cols + j] * recip_max;
        }
    }
}

By precomputing the reciprocal, we replace multiple divisions with a single division and multiple multiplications.

Compiler Optimizations

Modern compilers can perform many optimizations automatically, but they need guidance:

Using Compiler Flags

# Basic optimization
gcc -O2 tensor_ops.c -o tensor_program -lm

# Aggressive optimization
gcc -O3 -march=native -ffast-math tensor_ops.c -o tensor_program -lm

The -march=native flag enables CPU-specific optimizations, while -ffast-math relaxes IEEE floating-point rules for better performance (but potentially less accuracy).

Using Compiler Directives

Compiler directives can provide hints to the compiler:

void tensor_add_optimized(Tensor *result, const Tensor *a, const Tensor *b) {
    int size = a->dims[0] * a->dims[1];
    
    // Tell the compiler these pointers don't alias (don't overlap)
    float * __restrict__ result_data = result->data;
    float * __restrict__ a_data = a->data;
    float * __restrict__ b_data = b->data;
    
    #pragma GCC ivdep  // Tell GCC to ignore potential vector dependencies
    for (int i = 0; i < size; i++) {
        result_data[i] = a_data[i] + b_data[i];
    }
}

The __restrict__ keyword tells the compiler that pointers don’t alias, enabling more aggressive optimizations. The #pragma GCC ivdep directive tells the compiler to ignore potential vector dependencies.

Measuring Performance Improvements

Always measure the impact of your optimizations to ensure they’re actually helping:

#include <time.h>

double benchmark_function(void (*func)(Tensor*, const Tensor*, const Tensor*),
                         Tensor *result, const Tensor *a, const Tensor *b,
                         int iterations) {
    struct timespec start, end;
    clock_gettime(CLOCK_MONOTONIC, &start);
    
    for (int i = 0; i < iterations; i++) {
        func(result, a, b);
    }
    
    clock_gettime(CLOCK_MONOTONIC, &end);
    
    double elapsed = (end.tv_sec - start.tv_sec) +
                     (end.tv_nsec - start.tv_nsec) / 1e9;
    return elapsed / iterations;  // Average time per call
}

// Usage example
int main() {
    // Create test tensors
    Tensor *a = create_tensor(1000, 1000);
    Tensor *b = create_tensor(1000, 1000);
    Tensor *result = create_tensor(1000, 1000);
    
    // Initialize with random data
    // ...
    
    // Benchmark different implementations
    double time_naive = benchmark_function(tensor_matmul_naive, result, a, b, 10);
    double time_tiled = benchmark_function(tensor_matmul_tiled, result, a, b, 10);
    double time_transposed = benchmark_function(tensor_matmul_transposed, result, a, b, 10);
    
    printf("Naive implementation: %.6f seconds per call\n", time_naive);
    printf("Tiled implementation: %.6f seconds per call\n", time_tiled);
    printf("Transposed implementation: %.6f seconds per call\n", time_transposed);
    
    printf("Speedup (tiled vs naive): %.2fx\n", time_naive / time_tiled);
    printf("Speedup (transposed vs naive): %.2fx\n", time_naive / time_transposed);
    
    // Clean up
    free_tensor(a);
    free_tensor(b);
    free_tensor(result);
    
    return 0;
}

This benchmark function measures the average execution time of different implementations, allowing you to compare their performance.

Case Study: Optimizing Matrix Multiplication

Let’s walk through a complete optimization process for matrix multiplication:

Step 1: Profile the Initial Implementation

perf stat -e cycles,instructions,cache-references,cache-misses ./tensor_program

Output:

Performance counter stats for './tensor_program':

      8,245,123,456      cycles
      6,123,456,789      instructions     #    0.74  insn per cycle
        924,567,890      cache-references
        412,345,678      cache-misses     #   44.6% of all cache refs

       3.245678901 seconds time elapsed

The high cache miss rate (44.6%) and low instructions per cycle (0.74) suggest this is memory-bound.

Step 2: Analyze Memory Access Patterns

Using Cachegrind:

valgrind --tool=cachegrind ./tensor_program

The output shows that the inner loop of matrix multiplication has poor spatial locality when accessing the second matrix.

Step 3: Apply Optimizations

  1. Transpose the second matrix for better cache locality
  2. Apply loop tiling to improve cache utilization
  3. Unroll the innermost loop to reduce branch overhead
void tensor_matmul_optimized(Tensor *result, const Tensor *a, const Tensor *b) {
    int m = a->dims[0];
    int n = a->dims[1];
    int p = b->dims[1];
    
    // Create a transposed copy of B
    Tensor *b_trans = create_tensor(p, n);
    for (int i = 0; i < n; i++) {
        for (int j = 0; j < p; j++) {
            b_trans->data[j * n + i] = b->data[i * p + j];
        }
    }
    
    // Clear result tensor
    memset(result->data, 0, m * p * sizeof(float));
    
    // Tiling parameters
    const int TILE_M = 32;
    const int TILE_N = 32;
    const int TILE_P = 32;
    
    // Apply tiling
    for (int i0 = 0; i0 < m; i0 += TILE_M) {
        int i_end = i0 + TILE_M < m ? i0 + TILE_M : m;
        
        for (int j0 = 0; j0 < p; j0 += TILE_P) {
            int j_end = j0 + TILE_P < p ? j0 + TILE_P : p;
            
            for (int k0 = 0; k0 < n; k0 += TILE_N) {
                int k_end = k0 + TILE_N < n ? k0 + TILE_N : n;
                
                // Process a tile
                for (int i = i0; i < i_end; i++) {
                    for (int j = j0; j < j_end; j++) {
                        float sum = result->data[i * p + j];
                        int k = k0;
                        
                        // Unroll the innermost loop (4x)
                        for (; k < k_end - 3; k += 4) {
                            sum += a->data[i * n + k] * b_trans->data[j * n + k] +
                                  a->data[i * n + k+1] * b_trans->data[j * n + k+1] +
                                  a->data[i * n + k+2] * b_trans->data[j * n + k+2] +
                                  a->data[i * n + k+3] * b_trans->data[j * n + k+3];
                        }
                        
                        // Handle remaining elements
                        for (; k < k_end; k++) {
                            sum += a->data[i * n + k] * b_trans->data[j * n + k];
                        }
                        
                        result->data[i * p + j] = sum;
                    }
                }
            }
        }
    }
    
    free_tensor(b_trans);
}

Step 4: Measure Performance Improvement

perf stat -e cycles,instructions,cache-references,cache-misses ./tensor_program_optimized

Output:

Performance counter stats for './tensor_program_optimized':

      2,145,678,901      cycles
      5,678,901,234      instructions     #    2.65  insn per cycle
        345,678,901      cache-references
         45,678,901      cache-misses     #   13.2% of all cache refs

       0.876543210 seconds time elapsed

The optimized version shows:

  • 3.7x reduction in execution time
  • 3.6x improvement in instructions per cycle
  • 3.4x reduction in cache miss rate

Diagram: Performance Optimization Workflow

+----------------+     +-----------------+     +------------------+
| Identify       |     | Profile code    |     | Analyze         |
| performance    |---->| with gprof/perf |---->| bottlenecks     |
| requirements   |     |                 |     |                 |
+----------------+     +-----------------+     +------------------+
                                                       |
+----------------+     +-----------------+     +------------------+
| Iterate and    |     | Measure        |     | Apply targeted   |
| refine         |<----| performance    |<----| optimizations    |
| optimizations  |     | improvement    |     |                 |
+----------------+     +-----------------+     +------------------+

Diagram: Memory Access Patterns in Matrix Multiplication

Original Matrix Multiplication:

Matrix A (m×n)           Matrix B (n×p)           Result (m×p)
+---+---+---+---+        +---+---+---+---+        +---+---+---+---+
| A | A | A | A |        | B | B | B | B |        | C | C | C | C |
+---+---+---+---+        +---+---+---+---+        +---+---+---+---+
| A | A | A | A |        | B | B | B | B |        | C | C | C | C |
+---+---+---+---+        +---+---+---+---+        +---+---+---+---+
| A | A | A | A |        | B | B | B | B |        | C | C | C | C |
+---+---+---+---+        +---+---+---+---+        +---+---+---+---+

Memory Access Pattern:
- Row i of A: Sequential access (good locality)
- Column j of B: Strided access (poor locality)

Transposed Matrix Multiplication:

Matrix A (m×n)           Matrix B^T (p×n)         Result (m×p)
+---+---+---+---+        +---+---+---+---+        +---+---+---+---+
| A | A | A | A |        | B'| B'| B'| B'|        | C | C | C | C |
+---+---+---+---+        +---+---+---+---+        +---+---+---+---+
| A | A | A | A |        | B'| B'| B'| B'|        | C | C | C | C |
+---+---+---+---+        +---+---+---+---+        +---+---+---+---+
| A | A | A | A |        | B'| B'| B'| B'|        | C | C | C | C |
+---+---+---+---+        +---+---+---+---+        +---+---+---+---+

Memory Access Pattern:
- Row i of A: Sequential access (good locality)
- Row j of B^T: Sequential access (good locality)

Diagram: Cache Behavior with Tiling

Without Tiling:                With Tiling:

+------------------+           +------------------+
|                  |           | Tile 1 | Tile 2  |
|                  |           |--------|--------|
|     Matrix       |           | Tile 3 | Tile 4  |
|                  |           |--------|--------|
|                  |           | Tile 5 | Tile 6  |
+------------------+           +------------------+

Cache Behavior:                Cache Behavior:
- Entire rows/columns          - Small tiles fit in cache
  exceed cache size            - Process one tile completely
- High cache miss rate         - Move to next tile
                               - Lower cache miss rate

Common Pitfalls and Solutions

Pitfall 1: Premature Optimization

// Prematurely optimized code (hard to read, maintain)
void tensor_add_premature(Tensor *r, const Tensor *a, const Tensor *b) {
    float *rd = r->data, *ad = a->data, *bd = b->data;
    int s = a->dims[0] * a->dims[1], i = 0;
    for (; i < s - 7; i += 8) {
        rd[i] = ad[i] + bd[i];
        rd[i+1] = ad[i+1] + bd[i+1];
        rd[i+2] = ad[i+2] + bd[i+2];
        rd[i+3] = ad[i+3] + bd[i+3];
        rd[i+4] = ad[i+4] + bd[i+4];
        rd[i+5] = ad[i+5] + bd[i+5];
        rd[i+6] = ad[i+6] + bd[i+6];
        rd[i+7] = ad[i+7] + bd[i+7];
    }
    for (; i < s; i++) rd[i] = ad[i] + bd[i];
}

// Solution: Profile first, then optimize critical sections
void tensor_add(Tensor *result, const Tensor *a, const Tensor *b) {
    // Clear, readable implementation
    int size = a->dims[0] * a->dims[1];
    
    for (int i = 0; i < size; i++) {
        result->data[i] = a->data[i] + b->data[i];
    }
}

// Only after profiling, optimize if needed
void tensor_add_optimized(Tensor *result, const Tensor *a, const Tensor *b) {
    int size = a->dims[0] * a->dims[1];
    float *result_data = result->data;
    float *a_data = a->data;
    float *b_data = b->data;
    
    #pragma omp parallel for if(size > 10000)
    for (int i = 0; i < size; i++) {
        result_data[i] = a_data[i] + b_data[i];
    }
}

Pitfall 2: Ignoring Memory Alignment

// Unaligned memory access can be slower
Tensor* create_tensor_unaligned(int rows, int cols) {
    Tensor *t = malloc(sizeof(Tensor));
    t->dims[0] = rows;
    t->dims[1] = cols;
    t->data = malloc(rows * cols * sizeof(float));
    return t;
}

// Solution: Align memory for better performance
Tensor* create_tensor_aligned(int rows, int cols) {
    Tensor *t = malloc(sizeof(Tensor));
    t->dims[0] = rows;
    t->dims[1] = cols;
    
    // Align to 32-byte boundary for AVX operations
    #ifdef _WIN32
    t->data = _aligned_malloc(rows * cols * sizeof(float), 32);
    #else
    posix_memalign((void**)&t->data, 32, rows * cols * sizeof(float));
    #endif
    
    return t;
}

// Don't forget to use aligned free
void free_tensor_aligned(Tensor *t) {
    if (!t) return;
    
    #ifdef _WIN32
    _aligned_free(t->data);
    #else
    free(t->data);
    #endif
    
    free(t);
}

Pitfall 3: False Sharing in Parallel Code

// False sharing can degrade parallel performance
void parallel_tensor_reduction_naive(Tensor *input, float *result) {
    int rows = input->dims[0];
    int cols = input->dims[1];
    float *sums = calloc(omp_get_max_threads(), sizeof(float));
    
    #pragma omp parallel
    {
        int thread_id = omp_get_thread_num();
        sums[thread_id] = 0.0f;  // Adjacent memory locations cause false sharing
        
        #pragma omp for
        for (int i = 0; i < rows; i++) {
            for (int j = 0; j < cols; j++) {
                sums[thread_id] += input->data[i * cols + j];
            }
        }
    }
    
    // Combine results
    *result = 0.0f;
    for (int i = 0; i < omp_get_max_threads(); i++) {
        *result += sums[i];
    }
    
    free(sums);
}

// Solution: Pad to avoid false sharing
typedef struct {
    float value;
    char padding[64 - sizeof(float)];  // Pad to cache line size
} PaddedFloat;

void parallel_tensor_reduction_padded(Tensor *input, float *result) {
    int rows = input->dims[0];
    int cols = input->dims[1];
    PaddedFloat *sums = calloc(omp_get_max_threads(), sizeof(PaddedFloat));
    
    #pragma omp parallel
    {
        int thread_id = omp_get_thread_num();
        sums[thread_id].value = 0.0f;  // No false sharing due to padding
        
        #pragma omp for
        for (int i = 0; i < rows; i++) {
            for (int j = 0; j < cols; j++) {
                sums[thread_id].value += input->data[i * cols + j];
            }
        }
    }
    
    // Combine results
    *result = 0.0f;
    for (int i = 0; i < omp_get_max_threads(); i++) {
        *result += sums[i].value;
    }
    
    free(sums);
}

Practical Exercises

Exercise 1: Profile and Optimize a Tensor Convolution

Implement and optimize a 2D convolution operation for tensors:

  1. Start with a naive implementation
  2. Profile it using perf or gprof
  3. Apply at least two optimization techniques
  4. Measure and report the performance improvement

Partial Solution:

// Naive 2D convolution implementation
void tensor_conv2d_naive(Tensor *result, const Tensor *input, const Tensor *kernel) {
    int in_rows = input->dims[0];
    int in_cols = input->dims[1];
    int k_rows = kernel->dims[0];
    int k_cols = kernel->dims[1];
    int out_rows = in_rows - k_rows + 1;
    int out_cols = in_cols - k_cols + 1;
    
    // For each output position
    for (int i = 0; i < out_rows; i++) {
        for (int j = 0; j < out_cols; j++) {
            float sum = 0.0f;
            
            // Apply kernel
            for (int ki = 0; ki < k_rows; ki++) {
                for (int kj = 0; kj < k_cols; kj++) {
                    sum += input->data[(i + ki) * in_cols + (j + kj)] *
                           kernel->data[ki * k_cols + kj];
                }
            }
            
            result->data[i * out_cols + j] = sum;
        }
    }
}

// TODO: Implement optimized version with loop tiling and unrolling

Exercise 2: Implement Cache-Friendly Tensor Transposition

Implement an efficient tensor transposition algorithm that minimizes cache misses:

  1. Start with a naive implementation
  2. Use Cachegrind to analyze cache behavior
  3. Implement a cache-friendly version using blocking
  4. Compare performance and cache miss rates

Partial Solution:

// Naive transposition (poor cache behavior)
void tensor_transpose_naive(Tensor *result, const Tensor *input) {
    int rows = input->dims[0];
    int cols = input->dims[1];
    
    for (int i = 0; i < rows; i++) {
        for (int j = 0; j < cols; j++) {
            result->data[j * rows + i] = input->data[i * cols + j];
        }
    }
}

// TODO: Implement blocked transposition for better cache behavior

Exercise 3: Optimize a Tensor Reduction Operation

Implement and optimize a parallel reduction operation that computes the sum of all elements in a tensor:

  1. Start with a sequential implementation
  2. Implement a parallel version using OpenMP
  3. Identify and fix any false sharing issues
  4. Measure scalability with different numbers of threads

Partial Solution:

// Sequential reduction
float tensor_sum_sequential(const Tensor *input) {
    int size = input->dims[0] * input->dims[1];
    float sum = 0.0f;
    
    for (int i = 0; i < size; i++) {
        sum += input->data[i];
    }
    
    return sum;
}

// TODO: Implement parallel reduction with false sharing protection

Summary

In this chapter, we’ve explored the art and science of profiling and optimizing tensor operations in C:

  • Profiling Tools: We learned how to use gprof, perf, and Cachegrind to identify performance bottlenecks in tensor code.

  • Memory-Bound Optimizations: We applied techniques like loop tiling, data layout transformation, and loop interchange to improve cache utilization for memory-bound operations.

  • Compute-Bound Optimizations: We explored fast approximations and strength reduction to accelerate compute-bound operations.

  • Compiler Optimizations: We leveraged compiler flags and directives to enable automatic optimizations.

  • Measurement: We implemented benchmarking functions to quantify performance improvements and ensure our optimizations are effective.

Remember that optimization is an iterative process. Always start with a clear, correct implementation, profile to identify bottlenecks, apply targeted optimizations to critical sections, and measure the results. Repeat this process until you meet your performance requirements.

The most important lesson is to focus your optimization efforts where they’ll have the most impact. As the old engineering adage goes: “Premature optimization is the root of all evil.” Profile first, then optimize.

Further Reading

  1. “What Every Programmer Should Know About Memory” by Ulrich Drepper: https://people.freebsd.org/~lstewart/articles/cpumemory.pdf

  2. “Performance Analysis and Tuning on Modern CPUs” by Denis Bakhvalov: https://easyperf.net/blog/

  3. “Linux perf Examples” by Brendan Gregg: http://www.brendangregg.com/perf.html