Chapter 8: Profiling and Optimizing Hotspots
Identify and eliminate performance bottlenecks in your tensor code. Learn systematic approaches to profiling and optimization that yield substantial speedups.
You Will Learn To…
- Identify performance bottlenecks in tensor code using profiling tools
- Distinguish between memory-bound and compute-bound operations
- Apply targeted optimizations to critical code paths
- Measure and validate performance improvements
- Develop a systematic approach to performance tuning
Introduction
I once spent two weeks optimizing a tensor convolution operation that was the bottleneck in an image processing pipeline. The initial implementation was clean, readable, and painfully slow. After profiling, I discovered that 87% of execution time was spent in a single nested loop with poor cache utilization. A combination of loop tiling, data layout changes, and compiler directive tuning resulted in a 19x speedup—without touching 90% of the codebase.
That’s the reality of high-performance tensor programming: a small fraction of your code typically consumes the vast majority of execution time. This chapter is about finding those critical sections and systematically improving them.
Understanding Performance Bottlenecks
Before diving into profiling tools, let’s understand the two fundamental types of performance bottlenecks in tensor operations:
Compute-Bound Operations
Compute-bound operations are limited by CPU processing power. The CPU can’t process instructions fast enough to keep up with data availability. Examples include:
- Complex mathematical functions (exp, log, trigonometric functions)
- Dense matrix multiplication with small matrices that fit in cache
- Operations with high arithmetic intensity (many calculations per memory access)
Memory-Bound Operations
Memory-bound operations are limited by memory bandwidth or latency. The CPU spends time waiting for data to arrive from memory. Examples include:
- Large tensor operations that don’t fit in cache
- Operations with poor spatial locality (scattered memory access patterns)
- Operations with low arithmetic intensity (few calculations per memory access)
Most tensor operations in practice are memory-bound, especially for large tensors. This is why memory layout and access patterns are critical for performance.
Profiling Tools for C Tensor Programs
Using gprof for Function-Level Profiling
GNU’s gprof provides a straightforward way to identify which functions consume the most time:
# Compile with profiling information
gcc -pg -g -O2 tensor_ops.c -o tensor_program -lm
# Run the program (creates gmon.out)
./tensor_program
# Generate and view the profile
gprof ./tensor_program gmon.out > profile.txt
less profile.txt
A typical gprof output looks like this:
Flat profile:
Each sample counts as 0.01 seconds.
% cumulative self self total
time seconds seconds calls ms/call ms/call name
68.32 0.41 0.41 1000 0.41 0.41 tensor_matmul
21.67 0.54 0.13 10000 0.01 0.01 tensor_add
6.67 0.58 0.04 1000 0.04 0.04 tensor_transpose
3.34 0.60 0.02 3000 0.01 0.01 create_tensor
This output tells us that tensor_matmul
is consuming 68.32% of the execution time, making it the primary optimization target.
Using perf for Detailed CPU Analysis
Linux’s perf tool provides deeper insights, including cache misses and branch prediction failures:
# Compile with debug symbols but optimized code
gcc -g -O2 tensor_ops.c -o tensor_program -lm
# Run perf record
perf record -g ./tensor_program
# Analyze the results
perf report
A sample perf report might look like:
# Samples: 3K of event 'cycles'
# Event count (approx.): 2,847,558,934
#
# Overhead Command Symbol
# ........ ............... ..............................
#
64.32% tensor_program tensor_matmul
|--62.14%--tensor_matmul
| |--58.21%--0x4008f2
| |--3.93%--[...]
19.45% tensor_program tensor_add
8.12% tensor_program [kernel.kallsyms]
5.67% tensor_program tensor_transpose
For more detailed cache behavior, you can use specific events:
perf stat -e cycles,instructions,cache-references,cache-misses ./tensor_program
This might produce output like:
Performance counter stats for './tensor_program':
2,847,558,934 cycles
4,921,223,611 instructions # 1.73 insn per cycle
324,562,371 cache-references
128,945,221 cache-misses # 39.7% of all cache refs
1.023552562 seconds time elapsed
The high cache miss rate (39.7%) suggests this is a memory-bound application that could benefit from improved memory access patterns.
Using Cachegrind for Cache Simulation
Valgrind’s Cachegrind tool simulates the CPU cache to provide detailed information about cache behavior:
valgrind --tool=cachegrind ./tensor_program
cachegrind_annotate cachegrind.out.12345
This produces a detailed report of cache misses by function and line number:
--------------------------------------------------------------------------------
-- Auto-annotated source: tensor_ops.c
--------------------------------------------------------------------------------
Ir Dr D1mr DLmr Dw D1mw DLmw
. . . . . . . void tensor_matmul(Tensor *result, const Tensor *a, const Tensor *b) {
. . . . . . . int m = a->dims[0];
. . . . . . . int n = a->dims[1];
. . . . . . . int p = b->dims[1];
. . . . . . .
. . . . . . . for (int i = 0; i < m; i++) {
. . . . . . . for (int j = 0; j < p; j++) {
. . . . . . . float sum = 0.0f;
9,600,000 4,800,000 600,000 600,000 . . . for (int k = 0; k < n; k++) {
9,600,000 9,600,000 1,200,000 600,000 . . . sum += a->data[i * n + k] * b->data[k * p + j];
. . . . . . . }
. . . . 960,000 120,000 120,000 result->data[i * p + j] = sum;
. . . . . . . }
. . . . . . . }
. . . . . . . }
This shows that the inner loop of tensor_matmul
has a high number of cache misses (D1mr and DLmr columns), particularly when accessing b->data[k * p + j]
, which has a strided access pattern.
Optimizing Memory-Bound Operations
Now that we’ve identified our bottlenecks, let’s apply targeted optimizations for memory-bound operations.
Loop Tiling (Blocking)
Loop tiling improves cache utilization by operating on small blocks of data that fit in cache:
// Original matrix multiplication (poor cache utilization)
void tensor_matmul_naive(Tensor *result, const Tensor *a, const Tensor *b) {
int m = a->dims[0];
int n = a->dims[1];
int p = b->dims[1];
for (int i = 0; i < m; i++) {
for (int j = 0; j < p; j++) {
float sum = 0.0f;
for (int k = 0; k < n; k++) {
sum += a->data[i * n + k] * b->data[k * p + j];
}
result->data[i * p + j] = sum;
}
}
}
// Tiled matrix multiplication (better cache utilization)
void tensor_matmul_tiled(Tensor *result, const Tensor *a, const Tensor *b) {
int m = a->dims[0];
int n = a->dims[1];
int p = b->dims[1];
// Clear result tensor
memset(result->data, 0, m * p * sizeof(float));
// Choose tile sizes based on cache size
// These values should be tuned for your specific hardware
const int TILE_M = 32;
const int TILE_N = 32;
const int TILE_P = 32;
for (int i0 = 0; i0 < m; i0 += TILE_M) {
int i_end = i0 + TILE_M < m ? i0 + TILE_M : m;
for (int j0 = 0; j0 < p; j0 += TILE_P) {
int j_end = j0 + TILE_P < p ? j0 + TILE_P : p;
for (int k0 = 0; k0 < n; k0 += TILE_N) {
int k_end = k0 + TILE_N < n ? k0 + TILE_N : n;
// Process a tile
for (int i = i0; i < i_end; i++) {
for (int j = j0; j < j_end; j++) {
float sum = result->data[i * p + j];
for (int k = k0; k < k_end; k++) {
sum += a->data[i * n + k] * b->data[k * p + j];
}
result->data[i * p + j] = sum;
}
}
}
}
}
}
The tiled version processes small blocks of the matrices at a time, improving cache locality. The tile sizes should be tuned based on your CPU’s cache size.
Data Layout Transformation
Changing the data layout can significantly improve memory access patterns:
// Original row-major matrix multiplication (poor cache locality for B)
void tensor_matmul_row_major(Tensor *result, const Tensor *a, const Tensor *b) {
int m = a->dims[0];
int n = a->dims[1];
int p = b->dims[1];
for (int i = 0; i < m; i++) {
for (int j = 0; j < p; j++) {
float sum = 0.0f;
for (int k = 0; k < n; k++) {
// a->data[i * n + k] has good spatial locality
// b->data[k * p + j] has poor spatial locality (strided access)
sum += a->data[i * n + k] * b->data[k * p + j];
}
result->data[i * p + j] = sum;
}
}
}
// Transposed matrix multiplication (better cache locality)
void tensor_matmul_transposed(Tensor *result, const Tensor *a, const Tensor *b) {
int m = a->dims[0];
int n = a->dims[1];
int p = b->dims[1];
// Create a transposed copy of B for better cache locality
Tensor *b_trans = create_tensor(p, n);
for (int i = 0; i < n; i++) {
for (int j = 0; j < p; j++) {
b_trans->data[j * n + i] = b->data[i * p + j];
}
}
// Perform multiplication with transposed B
for (int i = 0; i < m; i++) {
for (int j = 0; j < p; j++) {
float sum = 0.0f;
for (int k = 0; k < n; k++) {
// Both accesses now have good spatial locality
sum += a->data[i * n + k] * b_trans->data[j * n + k];
}
result->data[i * p + j] = sum;
}
}
free_tensor(b_trans);
}
By transposing matrix B, we convert the strided access pattern into a contiguous one, significantly improving cache utilization.
Loop Interchange
Changing the order of nested loops can improve memory access patterns:
// Original loop order (poor spatial locality)
void tensor_transpose_naive(Tensor *result, const Tensor *input) {
int rows = input->dims[0];
int cols = input->dims[1];
for (int i = 0; i < rows; i++) {
for (int j = 0; j < cols; j++) {
// Strided writes to result
result->data[j * rows + i] = input->data[i * cols + j];
}
}
}
// Interchanged loop order (better spatial locality)
void tensor_transpose_interchanged(Tensor *result, const Tensor *input) {
int rows = input->dims[0];
int cols = input->dims[1];
for (int j = 0; j < cols; j++) {
for (int i = 0; i < rows; i++) {
// Sequential writes to result
result->data[j * rows + i] = input->data[i * cols + j];
}
}
}
By interchanging the loops, we convert strided memory accesses to sequential ones, improving cache utilization.
Loop Unrolling
Unrolling loops reduces branch mispredictions and allows for better instruction-level parallelism:
// Original loop
void tensor_scale_naive(Tensor *result, const Tensor *input, float scale) {
int size = input->dims[0] * input->dims[1];
for (int i = 0; i < size; i++) {
result->data[i] = input->data[i] * scale;
}
}
// Unrolled loop (4x)
void tensor_scale_unrolled(Tensor *result, const Tensor *input, float scale) {
int size = input->dims[0] * input->dims[1];
int i = 0;
// Process 4 elements at a time
for (; i < size - 3; i += 4) {
result->data[i] = input->data[i] * scale;
result->data[i+1] = input->data[i+1] * scale;
result->data[i+2] = input->data[i+2] * scale;
result->data[i+3] = input->data[i+3] * scale;
}
// Handle remaining elements
for (; i < size; i++) {
result->data[i] = input->data[i] * scale;
}
}
Loop unrolling reduces branch overhead and allows the compiler to better schedule instructions.
Optimizing Compute-Bound Operations
For compute-bound operations, focus on reducing the number of calculations or using more efficient algorithms.
Fast Approximations
For mathematical functions like exp, log, or trigonometric functions, consider using faster approximations:
// Original implementation using standard math functions
void tensor_sigmoid_naive(Tensor *result, const Tensor *input) {
int size = input->dims[0] * input->dims[1];
for (int i = 0; i < size; i++) {
// exp() is computationally expensive
result->data[i] = 1.0f / (1.0f + expf(-input->data[i]));
}
}
// Fast approximation of sigmoid function
float fast_sigmoid(float x) {
// Piecewise approximation of sigmoid
if (x < -5.0f) return 0.0f;
if (x > 5.0f) return 1.0f;
// Approximation in the range [-5, 5]
float x2 = x * x;
float e = 1.0f + fabsf(x) + x2 * 0.555f + x2 * x2 * 0.143f;
return (x > 0.0f) ? (e / (e + 1.0f)) : (1.0f / (e + 1.0f));
}
void tensor_sigmoid_fast(Tensor *result, const Tensor *input) {
int size = input->dims[0] * input->dims[1];
for (int i = 0; i < size; i++) {
result->data[i] = fast_sigmoid(input->data[i]);
}
}
The fast approximation sacrifices some accuracy for significant performance gains.
Strength Reduction
Replacing expensive operations with cheaper ones can improve performance:
// Original implementation with division
void tensor_normalize_naive(Tensor *result, const Tensor *input) {
int rows = input->dims[0];
int cols = input->dims[1];
for (int i = 0; i < rows; i++) {
// Find maximum in this row
float max_val = 0.0f;
for (int j = 0; j < cols; j++) {
float val = fabsf(input->data[i * cols + j]);
if (val > max_val) max_val = val;
}
// Normalize row by maximum (division is expensive)
for (int j = 0; j < cols; j++) {
result->data[i * cols + j] = input->data[i * cols + j] / max_val;
}
}
}
// Optimized implementation with multiplication
void tensor_normalize_optimized(Tensor *result, const Tensor *input) {
int rows = input->dims[0];
int cols = input->dims[1];
for (int i = 0; i < rows; i++) {
// Find maximum in this row
float max_val = 0.0f;
for (int j = 0; j < cols; j++) {
float val = fabsf(input->data[i * cols + j]);
if (val > max_val) max_val = val;
}
// Precompute reciprocal (one division instead of many)
float recip_max = 1.0f / max_val;
// Normalize row by maximum (multiplication is cheaper than division)
for (int j = 0; j < cols; j++) {
result->data[i * cols + j] = input->data[i * cols + j] * recip_max;
}
}
}
By precomputing the reciprocal, we replace multiple divisions with a single division and multiple multiplications.
Compiler Optimizations
Modern compilers can perform many optimizations automatically, but they need guidance:
Using Compiler Flags
# Basic optimization
gcc -O2 tensor_ops.c -o tensor_program -lm
# Aggressive optimization
gcc -O3 -march=native -ffast-math tensor_ops.c -o tensor_program -lm
The -march=native
flag enables CPU-specific optimizations, while -ffast-math
relaxes IEEE floating-point rules for better performance (but potentially less accuracy).
Using Compiler Directives
Compiler directives can provide hints to the compiler:
void tensor_add_optimized(Tensor *result, const Tensor *a, const Tensor *b) {
int size = a->dims[0] * a->dims[1];
// Tell the compiler these pointers don't alias (don't overlap)
float * __restrict__ result_data = result->data;
float * __restrict__ a_data = a->data;
float * __restrict__ b_data = b->data;
#pragma GCC ivdep // Tell GCC to ignore potential vector dependencies
for (int i = 0; i < size; i++) {
result_data[i] = a_data[i] + b_data[i];
}
}
The __restrict__
keyword tells the compiler that pointers don’t alias, enabling more aggressive optimizations. The #pragma GCC ivdep
directive tells the compiler to ignore potential vector dependencies.
Measuring Performance Improvements
Always measure the impact of your optimizations to ensure they’re actually helping:
#include <time.h>
double benchmark_function(void (*func)(Tensor*, const Tensor*, const Tensor*),
Tensor *result, const Tensor *a, const Tensor *b,
int iterations) {
struct timespec start, end;
clock_gettime(CLOCK_MONOTONIC, &start);
for (int i = 0; i < iterations; i++) {
func(result, a, b);
}
clock_gettime(CLOCK_MONOTONIC, &end);
double elapsed = (end.tv_sec - start.tv_sec) +
(end.tv_nsec - start.tv_nsec) / 1e9;
return elapsed / iterations; // Average time per call
}
// Usage example
int main() {
// Create test tensors
Tensor *a = create_tensor(1000, 1000);
Tensor *b = create_tensor(1000, 1000);
Tensor *result = create_tensor(1000, 1000);
// Initialize with random data
// ...
// Benchmark different implementations
double time_naive = benchmark_function(tensor_matmul_naive, result, a, b, 10);
double time_tiled = benchmark_function(tensor_matmul_tiled, result, a, b, 10);
double time_transposed = benchmark_function(tensor_matmul_transposed, result, a, b, 10);
printf("Naive implementation: %.6f seconds per call\n", time_naive);
printf("Tiled implementation: %.6f seconds per call\n", time_tiled);
printf("Transposed implementation: %.6f seconds per call\n", time_transposed);
printf("Speedup (tiled vs naive): %.2fx\n", time_naive / time_tiled);
printf("Speedup (transposed vs naive): %.2fx\n", time_naive / time_transposed);
// Clean up
free_tensor(a);
free_tensor(b);
free_tensor(result);
return 0;
}
This benchmark function measures the average execution time of different implementations, allowing you to compare their performance.
Case Study: Optimizing Matrix Multiplication
Let’s walk through a complete optimization process for matrix multiplication:
Step 1: Profile the Initial Implementation
perf stat -e cycles,instructions,cache-references,cache-misses ./tensor_program
Output:
Performance counter stats for './tensor_program':
8,245,123,456 cycles
6,123,456,789 instructions # 0.74 insn per cycle
924,567,890 cache-references
412,345,678 cache-misses # 44.6% of all cache refs
3.245678901 seconds time elapsed
The high cache miss rate (44.6%) and low instructions per cycle (0.74) suggest this is memory-bound.
Step 2: Analyze Memory Access Patterns
Using Cachegrind:
valgrind --tool=cachegrind ./tensor_program
The output shows that the inner loop of matrix multiplication has poor spatial locality when accessing the second matrix.
Step 3: Apply Optimizations
- Transpose the second matrix for better cache locality
- Apply loop tiling to improve cache utilization
- Unroll the innermost loop to reduce branch overhead
void tensor_matmul_optimized(Tensor *result, const Tensor *a, const Tensor *b) {
int m = a->dims[0];
int n = a->dims[1];
int p = b->dims[1];
// Create a transposed copy of B
Tensor *b_trans = create_tensor(p, n);
for (int i = 0; i < n; i++) {
for (int j = 0; j < p; j++) {
b_trans->data[j * n + i] = b->data[i * p + j];
}
}
// Clear result tensor
memset(result->data, 0, m * p * sizeof(float));
// Tiling parameters
const int TILE_M = 32;
const int TILE_N = 32;
const int TILE_P = 32;
// Apply tiling
for (int i0 = 0; i0 < m; i0 += TILE_M) {
int i_end = i0 + TILE_M < m ? i0 + TILE_M : m;
for (int j0 = 0; j0 < p; j0 += TILE_P) {
int j_end = j0 + TILE_P < p ? j0 + TILE_P : p;
for (int k0 = 0; k0 < n; k0 += TILE_N) {
int k_end = k0 + TILE_N < n ? k0 + TILE_N : n;
// Process a tile
for (int i = i0; i < i_end; i++) {
for (int j = j0; j < j_end; j++) {
float sum = result->data[i * p + j];
int k = k0;
// Unroll the innermost loop (4x)
for (; k < k_end - 3; k += 4) {
sum += a->data[i * n + k] * b_trans->data[j * n + k] +
a->data[i * n + k+1] * b_trans->data[j * n + k+1] +
a->data[i * n + k+2] * b_trans->data[j * n + k+2] +
a->data[i * n + k+3] * b_trans->data[j * n + k+3];
}
// Handle remaining elements
for (; k < k_end; k++) {
sum += a->data[i * n + k] * b_trans->data[j * n + k];
}
result->data[i * p + j] = sum;
}
}
}
}
}
free_tensor(b_trans);
}
Step 4: Measure Performance Improvement
perf stat -e cycles,instructions,cache-references,cache-misses ./tensor_program_optimized
Output:
Performance counter stats for './tensor_program_optimized':
2,145,678,901 cycles
5,678,901,234 instructions # 2.65 insn per cycle
345,678,901 cache-references
45,678,901 cache-misses # 13.2% of all cache refs
0.876543210 seconds time elapsed
The optimized version shows:
- 3.7x reduction in execution time
- 3.6x improvement in instructions per cycle
- 3.4x reduction in cache miss rate
Diagram: Performance Optimization Workflow
+----------------+ +-----------------+ +------------------+
| Identify | | Profile code | | Analyze |
| performance |---->| with gprof/perf |---->| bottlenecks |
| requirements | | | | |
+----------------+ +-----------------+ +------------------+
|
+----------------+ +-----------------+ +------------------+
| Iterate and | | Measure | | Apply targeted |
| refine |<----| performance |<----| optimizations |
| optimizations | | improvement | | |
+----------------+ +-----------------+ +------------------+
Diagram: Memory Access Patterns in Matrix Multiplication
Original Matrix Multiplication:
Matrix A (m×n) Matrix B (n×p) Result (m×p)
+---+---+---+---+ +---+---+---+---+ +---+---+---+---+
| A | A | A | A | | B | B | B | B | | C | C | C | C |
+---+---+---+---+ +---+---+---+---+ +---+---+---+---+
| A | A | A | A | | B | B | B | B | | C | C | C | C |
+---+---+---+---+ +---+---+---+---+ +---+---+---+---+
| A | A | A | A | | B | B | B | B | | C | C | C | C |
+---+---+---+---+ +---+---+---+---+ +---+---+---+---+
Memory Access Pattern:
- Row i of A: Sequential access (good locality)
- Column j of B: Strided access (poor locality)
Transposed Matrix Multiplication:
Matrix A (m×n) Matrix B^T (p×n) Result (m×p)
+---+---+---+---+ +---+---+---+---+ +---+---+---+---+
| A | A | A | A | | B'| B'| B'| B'| | C | C | C | C |
+---+---+---+---+ +---+---+---+---+ +---+---+---+---+
| A | A | A | A | | B'| B'| B'| B'| | C | C | C | C |
+---+---+---+---+ +---+---+---+---+ +---+---+---+---+
| A | A | A | A | | B'| B'| B'| B'| | C | C | C | C |
+---+---+---+---+ +---+---+---+---+ +---+---+---+---+
Memory Access Pattern:
- Row i of A: Sequential access (good locality)
- Row j of B^T: Sequential access (good locality)
Diagram: Cache Behavior with Tiling
Without Tiling: With Tiling:
+------------------+ +------------------+
| | | Tile 1 | Tile 2 |
| | |--------|--------|
| Matrix | | Tile 3 | Tile 4 |
| | |--------|--------|
| | | Tile 5 | Tile 6 |
+------------------+ +------------------+
Cache Behavior: Cache Behavior:
- Entire rows/columns - Small tiles fit in cache
exceed cache size - Process one tile completely
- High cache miss rate - Move to next tile
- Lower cache miss rate
Common Pitfalls and Solutions
Pitfall 1: Premature Optimization
// Prematurely optimized code (hard to read, maintain)
void tensor_add_premature(Tensor *r, const Tensor *a, const Tensor *b) {
float *rd = r->data, *ad = a->data, *bd = b->data;
int s = a->dims[0] * a->dims[1], i = 0;
for (; i < s - 7; i += 8) {
rd[i] = ad[i] + bd[i];
rd[i+1] = ad[i+1] + bd[i+1];
rd[i+2] = ad[i+2] + bd[i+2];
rd[i+3] = ad[i+3] + bd[i+3];
rd[i+4] = ad[i+4] + bd[i+4];
rd[i+5] = ad[i+5] + bd[i+5];
rd[i+6] = ad[i+6] + bd[i+6];
rd[i+7] = ad[i+7] + bd[i+7];
}
for (; i < s; i++) rd[i] = ad[i] + bd[i];
}
// Solution: Profile first, then optimize critical sections
void tensor_add(Tensor *result, const Tensor *a, const Tensor *b) {
// Clear, readable implementation
int size = a->dims[0] * a->dims[1];
for (int i = 0; i < size; i++) {
result->data[i] = a->data[i] + b->data[i];
}
}
// Only after profiling, optimize if needed
void tensor_add_optimized(Tensor *result, const Tensor *a, const Tensor *b) {
int size = a->dims[0] * a->dims[1];
float *result_data = result->data;
float *a_data = a->data;
float *b_data = b->data;
#pragma omp parallel for if(size > 10000)
for (int i = 0; i < size; i++) {
result_data[i] = a_data[i] + b_data[i];
}
}
Pitfall 2: Ignoring Memory Alignment
// Unaligned memory access can be slower
Tensor* create_tensor_unaligned(int rows, int cols) {
Tensor *t = malloc(sizeof(Tensor));
t->dims[0] = rows;
t->dims[1] = cols;
t->data = malloc(rows * cols * sizeof(float));
return t;
}
// Solution: Align memory for better performance
Tensor* create_tensor_aligned(int rows, int cols) {
Tensor *t = malloc(sizeof(Tensor));
t->dims[0] = rows;
t->dims[1] = cols;
// Align to 32-byte boundary for AVX operations
#ifdef _WIN32
t->data = _aligned_malloc(rows * cols * sizeof(float), 32);
#else
posix_memalign((void**)&t->data, 32, rows * cols * sizeof(float));
#endif
return t;
}
// Don't forget to use aligned free
void free_tensor_aligned(Tensor *t) {
if (!t) return;
#ifdef _WIN32
_aligned_free(t->data);
#else
free(t->data);
#endif
free(t);
}
Pitfall 3: False Sharing in Parallel Code
// False sharing can degrade parallel performance
void parallel_tensor_reduction_naive(Tensor *input, float *result) {
int rows = input->dims[0];
int cols = input->dims[1];
float *sums = calloc(omp_get_max_threads(), sizeof(float));
#pragma omp parallel
{
int thread_id = omp_get_thread_num();
sums[thread_id] = 0.0f; // Adjacent memory locations cause false sharing
#pragma omp for
for (int i = 0; i < rows; i++) {
for (int j = 0; j < cols; j++) {
sums[thread_id] += input->data[i * cols + j];
}
}
}
// Combine results
*result = 0.0f;
for (int i = 0; i < omp_get_max_threads(); i++) {
*result += sums[i];
}
free(sums);
}
// Solution: Pad to avoid false sharing
typedef struct {
float value;
char padding[64 - sizeof(float)]; // Pad to cache line size
} PaddedFloat;
void parallel_tensor_reduction_padded(Tensor *input, float *result) {
int rows = input->dims[0];
int cols = input->dims[1];
PaddedFloat *sums = calloc(omp_get_max_threads(), sizeof(PaddedFloat));
#pragma omp parallel
{
int thread_id = omp_get_thread_num();
sums[thread_id].value = 0.0f; // No false sharing due to padding
#pragma omp for
for (int i = 0; i < rows; i++) {
for (int j = 0; j < cols; j++) {
sums[thread_id].value += input->data[i * cols + j];
}
}
}
// Combine results
*result = 0.0f;
for (int i = 0; i < omp_get_max_threads(); i++) {
*result += sums[i].value;
}
free(sums);
}
Practical Exercises
Exercise 1: Profile and Optimize a Tensor Convolution
Implement and optimize a 2D convolution operation for tensors:
- Start with a naive implementation
- Profile it using perf or gprof
- Apply at least two optimization techniques
- Measure and report the performance improvement
Partial Solution:
// Naive 2D convolution implementation
void tensor_conv2d_naive(Tensor *result, const Tensor *input, const Tensor *kernel) {
int in_rows = input->dims[0];
int in_cols = input->dims[1];
int k_rows = kernel->dims[0];
int k_cols = kernel->dims[1];
int out_rows = in_rows - k_rows + 1;
int out_cols = in_cols - k_cols + 1;
// For each output position
for (int i = 0; i < out_rows; i++) {
for (int j = 0; j < out_cols; j++) {
float sum = 0.0f;
// Apply kernel
for (int ki = 0; ki < k_rows; ki++) {
for (int kj = 0; kj < k_cols; kj++) {
sum += input->data[(i + ki) * in_cols + (j + kj)] *
kernel->data[ki * k_cols + kj];
}
}
result->data[i * out_cols + j] = sum;
}
}
}
// TODO: Implement optimized version with loop tiling and unrolling
Exercise 2: Implement Cache-Friendly Tensor Transposition
Implement an efficient tensor transposition algorithm that minimizes cache misses:
- Start with a naive implementation
- Use Cachegrind to analyze cache behavior
- Implement a cache-friendly version using blocking
- Compare performance and cache miss rates
Partial Solution:
// Naive transposition (poor cache behavior)
void tensor_transpose_naive(Tensor *result, const Tensor *input) {
int rows = input->dims[0];
int cols = input->dims[1];
for (int i = 0; i < rows; i++) {
for (int j = 0; j < cols; j++) {
result->data[j * rows + i] = input->data[i * cols + j];
}
}
}
// TODO: Implement blocked transposition for better cache behavior
Exercise 3: Optimize a Tensor Reduction Operation
Implement and optimize a parallel reduction operation that computes the sum of all elements in a tensor:
- Start with a sequential implementation
- Implement a parallel version using OpenMP
- Identify and fix any false sharing issues
- Measure scalability with different numbers of threads
Partial Solution:
// Sequential reduction
float tensor_sum_sequential(const Tensor *input) {
int size = input->dims[0] * input->dims[1];
float sum = 0.0f;
for (int i = 0; i < size; i++) {
sum += input->data[i];
}
return sum;
}
// TODO: Implement parallel reduction with false sharing protection
Summary
In this chapter, we’ve explored the art and science of profiling and optimizing tensor operations in C:
-
Profiling Tools: We learned how to use gprof, perf, and Cachegrind to identify performance bottlenecks in tensor code.
-
Memory-Bound Optimizations: We applied techniques like loop tiling, data layout transformation, and loop interchange to improve cache utilization for memory-bound operations.
-
Compute-Bound Optimizations: We explored fast approximations and strength reduction to accelerate compute-bound operations.
-
Compiler Optimizations: We leveraged compiler flags and directives to enable automatic optimizations.
-
Measurement: We implemented benchmarking functions to quantify performance improvements and ensure our optimizations are effective.
Remember that optimization is an iterative process. Always start with a clear, correct implementation, profile to identify bottlenecks, apply targeted optimizations to critical sections, and measure the results. Repeat this process until you meet your performance requirements.
The most important lesson is to focus your optimization efforts where they’ll have the most impact. As the old engineering adage goes: “Premature optimization is the root of all evil.” Profile first, then optimize.
Further Reading
-
“What Every Programmer Should Know About Memory” by Ulrich Drepper: https://people.freebsd.org/~lstewart/articles/cpumemory.pdf
-
“Performance Analysis and Tuning on Modern CPUs” by Denis Bakhvalov: https://easyperf.net/blog/
-
“Linux perf Examples” by Brendan Gregg: http://www.brendangregg.com/perf.html