Chapter 3: Measuring Performance (and Why Assembly Isn't Enough)
Performance measurement is both an art and a science. While understanding assembly language is crucial, it’s only one piece of the performance optimization puzzle. Modern systems are complex, with multiple layers of abstraction, caching, and parallel execution that make simple instruction counting insufficient for real-world performance analysis.
The Importance of Benchmarking
Benchmarking is the foundation of performance analysis. It provides objective data about how code performs under specific conditions. However, creating meaningful benchmarks is more challenging than it might appear at first glance.
Common Benchmarking Pitfalls
- Microbenchmarking Fallacies
- Testing in isolation ignores system interactions
- Cache effects can dominate small test cases
- Compiler optimizations may eliminate test code
- Branch prediction can skew results
- The “Hot Cache” Problem
// Bad benchmark - only measures hot cache performance void benchmark() { for (int i = 0; i < 1000; i++) { measure_function(); } } // Better benchmark - includes cold cache scenarios void better_benchmark() { for (int i = 0; i < 1000; i++) { clear_cache(); // Simulate cold cache measure_function(); } }
- Compile-Time Optimization
// Bad benchmark - compiler might optimize away int sum = 0; for (int i = 0; i < 1000; i++) { sum += i; } // Better benchmark - prevent optimization volatile int sum = 0; for (int i = 0; i < 1000; i++) { sum += i; }
Why Assembly Line Counting Fails
Counting assembly instructions is a common but flawed approach to performance analysis. Here’s why:
- Modern Processor Architecture
- Superscalar execution
- Out-of-order processing
- Branch prediction
- Cache hierarchies
- Memory bandwidth limitations
- Example: The Memory Wall
// Two versions of array summation int sum_array_v1(int* array, int size) { int sum = 0; for (int i = 0; i < size; i++) { sum += array[i]; } return sum; } int sum_array_v2(int* array, int size) { int sum1 = 0, sum2 = 0; for (int i = 0; i < size; i += 2) { sum1 += array[i]; sum2 += array[i + 1]; } return sum1 + sum2; }
While version 2 has more assembly instructions, it might be faster due to:
- Better cache utilization
- Instruction-level parallelism
- Reduced loop overhead
Recommended Benchmarking Tools
1. Microbenchmarking Tools
Google Benchmark
#include <benchmark/benchmark.h>
static void BM_StringCreation(benchmark::State& state) {
for (auto _ : state) {
std::string empty_string;
}
}
BENCHMARK(BM_StringCreation);
static void BM_StringCopy(benchmark::State& state) {
std::string x = "hello";
for (auto _ : state) {
std::string copy(x);
}
}
BENCHMARK(BM_StringCopy);
BENCHMARK_MAIN();
Criterion (Rust)
use criterion::{black_box, criterion_group, criterion_main, Criterion};
fn fibonacci(n: u64) -> u64 {
match n {
0 => 1,
1 => 1,
n => fibonacci(n-1) + fibonacci(n-2),
}
}
fn criterion_benchmark(c: &mut Criterion) {
c.bench_function("fib 20", |b| b.iter(|| fibonacci(black_box(20))));
}
criterion_group!(benches, criterion_benchmark);
criterion_main!(benches);
2. Profiling Tools
Linux Perf
# Basic CPU profiling
perf record -g ./your_program
perf report
# Cache profiling
perf stat -e cache-misses,cache-references ./your_program
# Branch prediction profiling
perf stat -e branch-misses,branch-instructions ./your_program
Intel VTune
# Basic hotspot analysis
vtune -collect hotspots ./your_program
# Memory access analysis
vtune -collect memory-access ./your_program
# Threading analysis
vtune -collect threading ./your_program
3. System Monitoring Tools
Prometheus + Grafana
# prometheus.yml
global:
scrape_interval: 15s
scrape_configs:
- job_name: 'application'
static_configs:
- targets: ['localhost:9090']
Custom Metrics Collection
from prometheus_client import start_http_server, Counter
import time
REQUEST_COUNT = Counter('request_count', 'Total request count')
REQUEST_LATENCY = Counter('request_latency_seconds', 'Request latency in seconds')
def process_request():
start_time = time.time()
# Process request
REQUEST_COUNT.inc()
REQUEST_LATENCY.inc(time.time() - start_time)
if __name__ == '__main__':
start_http_server(8000)
while True:
process_request()
Performance Analysis Techniques
1. Statistical Analysis
Understanding performance requires statistical rigor:
import numpy as np
from scipy import stats
def analyze_performance(samples):
mean = np.mean(samples)
std = np.std(samples)
ci = stats.t.interval(0.95, len(samples)-1,
loc=mean, scale=std/np.sqrt(len(samples)))
return {
'mean': mean,
'std': std,
'ci_95': ci,
'outliers': detect_outliers(samples)
}
2. Performance Counters
Modern processors provide detailed performance counters:
#include <linux/perf_event.h>
#include <sys/syscall.h>
#include <unistd.h>
long perf_event_open(struct perf_event_attr *hw_event, pid_t pid,
int cpu, int group_fd, unsigned long flags) {
return syscall(__NR_perf_event_open, hw_event, pid, cpu,
group_fd, flags);
}
void measure_cache_misses() {
struct perf_event_attr pe;
memset(&pe, 0, sizeof(struct perf_event_attr));
pe.type = PERF_TYPE_HW_CACHE;
pe.size = sizeof(struct perf_event_attr);
pe.config = PERF_COUNT_HW_CACHE_MISSES;
pe.disabled = 1;
pe.exclude_kernel = 1;
pe.exclude_hv = 1;
int fd = perf_event_open(&pe, 0, -1, -1, 0);
if (fd == -1) {
fprintf(stderr, "Error opening performance counter\n");
return;
}
ioctl(fd, PERF_EVENT_IOC_RESET, 0);
ioctl(fd, PERF_EVENT_IOC_ENABLE, 0);
// Run your code here
ioctl(fd, PERF_EVENT_IOC_DISABLE, 0);
long long count;
read(fd, &count, sizeof(long long));
printf("Cache misses: %lld\n", count);
close(fd);
}
3. Memory Access Patterns
Understanding memory access patterns is crucial:
// Good memory access pattern
void process_array(int* array, int size) {
for (int i = 0; i < size; i++) {
array[i] = process_element(array[i]);
}
}
// Bad memory access pattern (random access)
void process_linked_list(Node* head) {
while (head) {
head->data = process_element(head->data);
head = head->next;
}
}
Real-World Performance Analysis
Case Study: Database Query Optimization
Consider a simple database query:
SELECT * FROM users WHERE age > 30 AND country = 'USA';
The performance characteristics depend on:
- Index availability
- Data distribution
- Memory pressure
- Disk I/O patterns
- Cache utilization
Case Study: Web Server Performance
A web server’s performance depends on multiple factors:
from flask import Flask
import time
app = Flask(__name__)
@app.route('/api/data')
def get_data():
start_time = time.time()
# Database query
db_time = time.time()
data = db.query()
db_duration = time.time() - db_time
# Processing
process_time = time.time()
result = process_data(data)
process_duration = time.time() - process_time
# Response
response_time = time.time()
response = jsonify(result)
response_duration = time.time() - response_time
total_duration = time.time() - start_time
# Log performance metrics
log_performance({
'db_duration': db_duration,
'process_duration': process_duration,
'response_duration': response_duration,
'total_duration': total_duration
})
return response
Best Practices for Performance Measurement
- Establish Baselines
- Measure before optimization
- Document system configuration
- Record environmental factors
- Use Multiple Metrics
- CPU time
- Memory usage
- Cache behavior
- I/O operations
- Network latency
- Consider the Full Stack
- Application code
- Runtime environment
- Operating system
- Hardware
- Network infrastructure
- Document Everything
- Test conditions
- System configuration
- Compiler flags
- Runtime parameters
- Environmental factors
Summary
Performance measurement requires a holistic approach that goes beyond simple instruction counting. Modern systems are complex, with multiple layers of abstraction and optimization. Effective performance analysis requires:
- Understanding the full system stack
- Using appropriate benchmarking tools
- Applying statistical rigor
- Considering real-world usage patterns
- Documenting and analyzing results systematically
Remember that performance optimization is an iterative process. Measure, analyze, optimize, and repeat. Each iteration should be guided by data and a deep understanding of both the code and the system it runs on.