Chapter 10: Deploying Tensor Applications on Embedded Systems
Adapt your tensor code for resource-constrained environments. Learn specialized techniques for embedded systems where memory and processing power are limited.
You Will Learn To…
- Cross-compile tensor applications for ARM and other embedded architectures
- Implement fixed-point arithmetic for environments without FPU support
- Optimize tensor operations for memory-constrained devices
- Quantize floating-point models to integer precision
- Profile and optimize tensor code for real-time performance
- Design efficient inference pipelines for embedded applications
Introduction
I remember the first time I tried to deploy a tensor application on an embedded device. I had spent weeks optimizing my code on my development machine—a beefy workstation with 32GB of RAM and a modern CPU. Everything ran beautifully. Then I moved the code to a Raspberry Pi, and it crawled to a halt. The memory usage exploded, and operations that took milliseconds on my workstation took seconds on the Pi.
That painful experience taught me that deploying tensor applications on embedded systems requires a fundamentally different approach than developing on desktop or server environments. You can’t just compile your code and expect it to work efficiently—you need to rethink your algorithms, data structures, and optimization strategies.
In this chapter, we’ll explore the techniques and tools for deploying tensor applications on embedded systems. We’ll cover cross-compilation, fixed-point arithmetic, quantization, and memory optimization. By the end, you’ll be able to take the tensor library we’ve built throughout this book and adapt it for efficient execution on resource-constrained devices.
Understanding Embedded Constraints
Before diving into implementation details, let’s understand the constraints we’re working with:
Hardware Limitations
Embedded systems typically have:
- Limited RAM (often measured in KB rather than GB)
- Slower CPUs with simpler architectures
- Limited or no floating-point hardware (FPU)
- Power constraints (battery-operated devices)
- Limited storage for code and data
Real-time Requirements
Many embedded applications have real-time constraints:
- Predictable execution time is often more important than average performance
- Garbage collection or memory allocation during critical sections is unacceptable
- Missed deadlines can have serious consequences (e.g., in control systems)
Development Challenges
- Cross-compilation complexity
- Limited debugging tools
- Platform-specific optimizations
- Testing on target hardware
With these constraints in mind, let’s start by setting up our cross-compilation environment.
Setting Up a Cross-Compilation Environment
Cross-compilation allows us to build code on our development machine that will run on a different target architecture. This is essential for embedded development since we typically can’t compile large codebases directly on the target device.
Installing a Cross-Compiler
For ARM-based systems (like Raspberry Pi, many microcontrollers, etc.), we’ll use the GNU ARM toolchain:
# On Ubuntu/Debian
sudo apt-get install gcc-arm-linux-gnueabihf g++-arm-linux-gnueabihf
# On macOS with Homebrew
brew install arm-linux-gnueabihf-binutils
For other architectures, you’ll need the appropriate toolchain. For example, for RISC-V:
# On Ubuntu/Debian
sudo apt-get install gcc-riscv64-linux-gnu
Creating a Cross-Compilation Makefile
Let’s create a Makefile that supports both native and cross-compilation:
# Default to native compilation
TARGET_ARCH ?= native
# Compiler and flags
ifeq ($(TARGET_ARCH),native)
CC = gcc
AR = ar
else ifeq ($(TARGET_ARCH),arm)
CC = arm-linux-gnueabihf-gcc
AR = arm-linux-gnueabihf-ar
else ifeq ($(TARGET_ARCH),riscv)
CC = riscv64-linux-gnu-gcc
AR = riscv64-linux-gnu-ar
else
$(error Unsupported TARGET_ARCH: $(TARGET_ARCH))
endif
# Common flags
CFLAGS = -Wall -Wextra -std=c11
# Architecture-specific flags
ifeq ($(TARGET_ARCH),native)
CFLAGS += -march=native -O3
else ifeq ($(TARGET_ARCH),arm)
CFLAGS += -mcpu=cortex-a7 -mfpu=neon-vfpv4 -mfloat-abi=hard -O3
else ifeq ($(TARGET_ARCH),riscv)
CFLAGS += -march=rv64gc -O3
endif
# Source files
SRC = tensor.c fixed_point.c quantization.c
OBJ = $(SRC:.c=.o)
# Library name
LIB = libtensor.a
# Build the library
$(LIB): $(OBJ)
$(AR) rcs $@ $^
# Compile source files
%.o: %.c
$(CC) $(CFLAGS) -c $< -o $@
# Clean build artifacts
clean:
rm -f $(OBJ) $(LIB)
# Example target
example: example.c $(LIB)
$(CC) $(CFLAGS) $^ -o $@ -lm
.PHONY: clean
To use this Makefile, you can specify the target architecture:
# For native compilation
make
# For ARM cross-compilation
make TARGET_ARCH=arm
# For RISC-V cross-compilation
make TARGET_ARCH=riscv
Testing Cross-Compiled Code
After cross-compiling, you need to test your code on the target device or using an emulator:
# Copy to Raspberry Pi (example)
scp example pi@raspberrypi.local:~/
# Or use QEMU for emulation
qemu-arm -L /usr/arm-linux-gnueabihf ./example
Now that we have our cross-compilation environment set up, let’s adapt our tensor library for embedded systems.
Implementing Fixed-Point Arithmetic
Many embedded systems lack floating-point hardware, making floating-point operations slow. Fixed-point arithmetic provides a faster alternative by representing fractional numbers using integers.
Fixed-Point Representation
In fixed-point representation, we allocate a fixed number of bits for the integer part and the fractional part. For example, in a Q16.16 format, we use 16 bits for the integer part and 16 bits for the fractional part, giving us a 32-bit number overall.
Let’s define our fixed-point type:
#include <stdint.h>
// Q16.16 fixed-point format
typedef int32_t fixed_t;
// Number of fractional bits
#define FIXED_FRAC_BITS 16
// Convert float to fixed-point
fixed_t float_to_fixed(float f) {
return (fixed_t)(f * (1 << FIXED_FRAC_BITS));
}
// Convert fixed-point to float
float fixed_to_float(fixed_t f) {
return ((float)f) / (1 << FIXED_FRAC_BITS);
}
Basic Fixed-Point Operations
Now let’s implement basic arithmetic operations for fixed-point numbers:
// Addition: straightforward integer addition
fixed_t fixed_add(fixed_t a, fixed_t b) {
return a + b;
}
// Subtraction: straightforward integer subtraction
fixed_t fixed_sub(fixed_t a, fixed_t b) {
return a - b;
}
// Multiplication: need to handle the fractional bits
fixed_t fixed_mul(fixed_t a, fixed_t b) {
// Use 64-bit intermediate to prevent overflow
int64_t result = (int64_t)a * (int64_t)b;
// Shift right to account for the extra fractional bits
return (fixed_t)(result >> FIXED_FRAC_BITS);
}
// Division: need to handle the fractional bits
fixed_t fixed_div(fixed_t a, fixed_t b) {
// Pre-shift a to prevent underflow
int64_t result = ((int64_t)a << FIXED_FRAC_BITS) / (int64_t)b;
return (fixed_t)result;
}
Fixed-Point Tensor Implementation
Now let’s create a fixed-point version of our tensor structure:
typedef struct {
int ndim; // Number of dimensions
int* dims; // Size of each dimension
int size; // Total number of elements
fixed_t* data; // Pointer to data
} fixed_tensor_t;
// Create a new fixed-point tensor
fixed_tensor_t* fixed_tensor_create(int ndim, int* dims) {
fixed_tensor_t* tensor = (fixed_tensor_t*)malloc(sizeof(fixed_tensor_t));
if (!tensor) return NULL;
tensor->ndim = ndim;
tensor->dims = (int*)malloc(ndim * sizeof(int));
if (!tensor->dims) {
free(tensor);
return NULL;
}
tensor->size = 1;
for (int i = 0; i < ndim; i++) {
tensor->dims[i] = dims[i];
tensor->size *= dims[i];
}
tensor->data = (fixed_t*)calloc(tensor->size, sizeof(fixed_t));
if (!tensor->data) {
free(tensor->dims);
free(tensor);
return NULL;
}
return tensor;
}
// Free a fixed-point tensor
void fixed_tensor_free(fixed_tensor_t* tensor) {
if (!tensor) return;
free(tensor->dims);
free(tensor->data);
free(tensor);
}
// Convert a float tensor to fixed-point
fixed_tensor_t* tensor_to_fixed(tensor_t* float_tensor) {
fixed_tensor_t* fixed_tensor = fixed_tensor_create(float_tensor->ndim, float_tensor->dims);
if (!fixed_tensor) return NULL;
for (int i = 0; i < float_tensor->size; i++) {
fixed_tensor->data[i] = float_to_fixed(float_tensor->data[i]);
}
return fixed_tensor;
}
// Convert a fixed-point tensor to float
tensor_t* fixed_to_tensor(fixed_tensor_t* fixed_tensor) {
tensor_t* float_tensor = tensor_create(fixed_tensor->ndim, fixed_tensor->dims);
if (!float_tensor) return NULL;
for (int i = 0; i < fixed_tensor->size; i++) {
float_tensor->data[i] = fixed_to_float(fixed_tensor->data[i]);
}
return float_tensor;
}
Implementing Tensor Operations with Fixed-Point
Let’s implement some basic tensor operations using fixed-point arithmetic:
// Element-wise addition
void fixed_tensor_add(fixed_tensor_t* a, fixed_tensor_t* b, fixed_tensor_t* result) {
// Assume dimensions are compatible
for (int i = 0; i < result->size; i++) {
result->data[i] = fixed_add(a->data[i], b->data[i]);
}
}
// Element-wise multiplication
void fixed_tensor_mul(fixed_tensor_t* a, fixed_tensor_t* b, fixed_tensor_t* result) {
// Assume dimensions are compatible
for (int i = 0; i < result->size; i++) {
result->data[i] = fixed_mul(a->data[i], b->data[i]);
}
}
// Matrix multiplication
void fixed_tensor_matmul(fixed_tensor_t* a, fixed_tensor_t* b, fixed_tensor_t* result) {
// Assume a is [M,K] and b is [K,N], result is [M,N]
int M = a->dims[0];
int K = a->dims[1];
int N = b->dims[1];
// Initialize result to zero
memset(result->data, 0, result->size * sizeof(fixed_t));
// Naive matrix multiplication
for (int m = 0; m < M; m++) {
for (int n = 0; n < N; n++) {
fixed_t sum = 0;
for (int k = 0; k < K; k++) {
fixed_t a_val = a->data[m * K + k];
fixed_t b_val = b->data[k * N + n];
sum = fixed_add(sum, fixed_mul(a_val, b_val));
}
result->data[m * N + n] = sum;
}
}
}
Optimizing Fixed-Point Operations
Fixed-point operations can be further optimized for specific architectures. For example, on ARM Cortex-M4 and above, we can use the DSP extension:
#ifdef __ARM_FEATURE_DSP
#include <arm_math.h>
// Optimized matrix multiplication using ARM DSP library
void fixed_tensor_matmul_optimized(fixed_tensor_t* a, fixed_tensor_t* b, fixed_tensor_t* result) {
// Assume a is [M,K] and b is [K,N], result is [M,N]
int M = a->dims[0];
int K = a->dims[1];
int N = b->dims[1];
// Use ARM DSP matrix multiplication
arm_matrix_instance_q31 a_matrix;
arm_matrix_instance_q31 b_matrix;
arm_matrix_instance_q31 result_matrix;
arm_mat_init_q31(&a_matrix, M, K, (q31_t*)a->data);
arm_mat_init_q31(&b_matrix, K, N, (q31_t*)b->data);
arm_mat_init_q31(&result_matrix, M, N, (q31_t*)result->data);
arm_mat_mult_q31(&a_matrix, &b_matrix, &result_matrix);
}
#endif
Fixed-point arithmetic is a powerful technique for embedded systems, but it requires careful handling of precision and range. Let’s now look at another approach: quantization.
Quantizing Tensor Operations
Quantization is the process of mapping floating-point values to integers with lower precision. Unlike fixed-point, which uses a fixed scaling factor, quantization can adapt the scaling factor based on the data range.
Linear Quantization
The simplest form of quantization is linear quantization, where we map a floating-point range [min, max] to an integer range [0, 255] (for 8-bit quantization):
typedef struct {
uint8_t* data; // Quantized data
float scale; // Scale factor: float = (int - zero_point) * scale
uint8_t zero_point; // Value that represents 0 in the original data
int size; // Number of elements
} quant_tensor_t;
// Quantize a float tensor to 8-bit
quant_tensor_t* tensor_quantize(tensor_t* float_tensor) {
quant_tensor_t* q_tensor = (quant_tensor_t*)malloc(sizeof(quant_tensor_t));
if (!q_tensor) return NULL;
// Find min and max values
float min_val = float_tensor->data[0];
float max_val = float_tensor->data[0];
for (int i = 1; i < float_tensor->size; i++) {
if (float_tensor->data[i] < min_val) min_val = float_tensor->data[i];
if (float_tensor->data[i] > max_val) max_val = float_tensor->data[i];
}
// Compute scale and zero_point
float scale = (max_val - min_val) / 255.0f;
uint8_t zero_point = (uint8_t)(-min_val / scale);
// Allocate and fill quantized data
q_tensor->data = (uint8_t*)malloc(float_tensor->size * sizeof(uint8_t));
if (!q_tensor->data) {
free(q_tensor);
return NULL;
}
for (int i = 0; i < float_tensor->size; i++) {
float val = float_tensor->data[i];
int quantized = (int)roundf(val / scale) + zero_point;
q_tensor->data[i] = (uint8_t)CLAMP(quantized, 0, 255);
}
q_tensor->scale = scale;
q_tensor->zero_point = zero_point;
q_tensor->size = float_tensor->size;
return q_tensor;
}
// Dequantize back to float
tensor_t* tensor_dequantize(quant_tensor_t* q_tensor, int ndim, int* dims) {
tensor_t* float_tensor = tensor_create(ndim, dims);
if (!float_tensor) return NULL;
for (int i = 0; i < q_tensor->size; i++) {
float val = ((int)q_tensor->data[i] - q_tensor->zero_point) * q_tensor->scale;
float_tensor->data[i] = val;
}
return float_tensor;
}
Quantized Matrix Multiplication
Quantized operations require careful handling of scales and zero points. Here’s an implementation of quantized matrix multiplication:
// Helper macro to clamp values
#define CLAMP(x, low, high) ((x) < (low) ? (low) : ((x) > (high) ? (high) : (x)))
// Quantized matrix multiplication
void quant_tensor_matmul(quant_tensor_t* a, quant_tensor_t* b, quant_tensor_t* result,
int M, int K, int N) {
// Compute the result scale
float result_scale = a->scale * b->scale;
// Perform matrix multiplication with dequantization and requantization
for (int m = 0; m < M; m++) {
for (int n = 0; n < N; n++) {
int32_t sum = 0;
for (int k = 0; k < K; k++) {
int a_val = (int)a->data[m * K + k] - a->zero_point;
int b_val = (int)b->data[k * N + n] - b->zero_point;
sum += a_val * b_val;
}
// Requantize the result
float float_result = sum * result_scale;
int quantized = (int)roundf(float_result / result->scale) + result->zero_point;
result->data[m * N + n] = (uint8_t)CLAMP(quantized, 0, 255);
}
}
}
Optimizing Quantized Operations
Many modern embedded processors have instructions specifically designed for quantized operations. For example, ARM NEON provides SIMD instructions for 8-bit integer arithmetic:
#ifdef __ARM_NEON
#include <arm_neon.h>
// NEON-optimized quantized matrix multiplication
void quant_tensor_matmul_neon(quant_tensor_t* a, quant_tensor_t* b, quant_tensor_t* result,
int M, int K, int N) {
// Compute the result scale
float result_scale = a->scale * b->scale;
for (int m = 0; m < M; m++) {
for (int n = 0; n < N; n++) {
int32_t sum = 0;
// Process 16 elements at a time using NEON
int k = 0;
for (; k <= K - 16; k += 16) {
uint8x16_t a_vec = vld1q_u8(&a->data[m * K + k]);
uint8x16_t b_vec = vld1q_u8(&b->data[k * N + n]);
// Convert to int16 and subtract zero points
int16x8_t a_low = vreinterpretq_s16_u16(vmovl_u8(vget_low_u8(a_vec)));
int16x8_t a_high = vreinterpretq_s16_u16(vmovl_u8(vget_high_u8(a_vec)));
int16x8_t b_low = vreinterpretq_s16_u16(vmovl_u8(vget_low_u8(b_vec)));
int16x8_t b_high = vreinterpretq_s16_u16(vmovl_u8(vget_high_u8(b_vec)));
a_low = vsubq_s16(a_low, vdupq_n_s16(a->zero_point));
a_high = vsubq_s16(a_high, vdupq_n_s16(a->zero_point));
b_low = vsubq_s16(b_low, vdupq_n_s16(b->zero_point));
b_high = vsubq_s16(b_high, vdupq_n_s16(b->zero_point));
// Multiply and accumulate
sum += vaddvq_s32(vpaddlq_s16(vmulq_s16(a_low, b_low)));
sum += vaddvq_s32(vpaddlq_s16(vmulq_s16(a_high, b_high)));
}
// Handle remaining elements
for (; k < K; k++) {
int a_val = (int)a->data[m * K + k] - a->zero_point;
int b_val = (int)b->data[k * N + n] - b->zero_point;
sum += a_val * b_val;
}
// Requantize the result
float float_result = sum * result_scale;
int quantized = (int)roundf(float_result / result->scale) + result->zero_point;
result->data[m * N + n] = (uint8_t)CLAMP(quantized, 0, 255);
}
}
}
#endif
Quantization is a powerful technique for reducing memory usage and improving performance on embedded systems. However, it comes with a trade-off in accuracy. Let’s now look at memory optimization techniques.
Memory Optimization Techniques
Embedded systems often have severe memory constraints. Here are some techniques to optimize memory usage in tensor applications:
In-Place Operations
In-place operations modify tensors directly without creating temporary copies:
// In-place element-wise addition
void tensor_add_inplace(tensor_t* a, tensor_t* b) {
// Assume dimensions are compatible
for (int i = 0; i < a->size; i++) {
a->data[i] += b->data[i];
}
}
// In-place ReLU activation
void tensor_relu_inplace(tensor_t* tensor) {
for (int i = 0; i < tensor->size; i++) {
if (tensor->data[i] < 0) tensor->data[i] = 0;
}
}
Memory Pooling
Instead of allocating and freeing memory for each operation, use a memory pool:
typedef struct {
void* memory; // Pointer to the memory pool
size_t size; // Total size of the pool
size_t used; // Amount of memory currently used
} memory_pool_t;
// Initialize a memory pool
memory_pool_t* memory_pool_create(size_t size) {
memory_pool_t* pool = (memory_pool_t*)malloc(sizeof(memory_pool_t));
if (!pool) return NULL;
pool->memory = malloc(size);
if (!pool->memory) {
free(pool);
return NULL;
}
pool->size = size;
pool->used = 0;
return pool;
}
// Allocate memory from the pool
void* memory_pool_alloc(memory_pool_t* pool, size_t size) {
// Align size to 8 bytes
size = (size + 7) & ~7;
if (pool->used + size > pool->size) {
return NULL; // Not enough memory
}
void* ptr = (char*)pool->memory + pool->used;
pool->used += size;
return ptr;
}
// Reset the pool (free all allocations)
void memory_pool_reset(memory_pool_t* pool) {
pool->used = 0;
}
// Free the pool
void memory_pool_free(memory_pool_t* pool) {
if (!pool) return;
free(pool->memory);
free(pool);
}
// Create a tensor using the memory pool
tensor_t* tensor_create_from_pool(memory_pool_t* pool, int ndim, int* dims) {
tensor_t* tensor = (tensor_t*)memory_pool_alloc(pool, sizeof(tensor_t));
if (!tensor) return NULL;
tensor->ndim = ndim;
tensor->dims = (int*)memory_pool_alloc(pool, ndim * sizeof(int));
if (!tensor->dims) return NULL;
tensor->size = 1;
for (int i = 0; i < ndim; i++) {
tensor->dims[i] = dims[i];
tensor->size *= dims[i];
}
tensor->data = (float*)memory_pool_alloc(pool, tensor->size * sizeof(float));
if (!tensor->data) return NULL;
// Initialize to zero
memset(tensor->data, 0, tensor->size * sizeof(float));
return tensor;
}
Tensor Reuse
Instead of creating new tensors for each operation, reuse existing ones:
// Reuse a tensor for a new operation
void tensor_reuse(tensor_t* tensor, int ndim, int* dims) {
// Check if dimensions match
int new_size = 1;
for (int i = 0; i < ndim; i++) {
new_size *= dims[i];
}
// If the new size is larger, we need to reallocate
if (new_size > tensor->size) {
free(tensor->data);
tensor->data = (float*)malloc(new_size * sizeof(float));
if (!tensor->data) {
// Handle allocation failure
tensor->size = 0;
return;
}
}
// Update dimensions
free(tensor->dims);
tensor->dims = (int*)malloc(ndim * sizeof(int));
if (!tensor->dims) {
// Handle allocation failure
free(tensor->data);
tensor->data = NULL;
tensor->size = 0;
return;
}
tensor->ndim = ndim;
tensor->size = new_size;
for (int i = 0; i < ndim; i++) {
tensor->dims[i] = dims[i];
}
// Initialize to zero
memset(tensor->data, 0, tensor->size * sizeof(float));
}
Static Memory Allocation
For extremely constrained systems, use static allocation instead of dynamic:
#define MAX_TENSOR_SIZE 1024
typedef struct {
int ndim;
int dims[4]; // Maximum 4 dimensions
int size;
float data[MAX_TENSOR_SIZE];
} static_tensor_t;
// Initialize a static tensor
void static_tensor_init(static_tensor_t* tensor, int ndim, int* dims) {
tensor->ndim = ndim;
tensor->size = 1;
for (int i = 0; i < ndim; i++) {
tensor->dims[i] = dims[i];
tensor->size *= dims[i];
}
// Check if the tensor fits in the static buffer
if (tensor->size > MAX_TENSOR_SIZE) {
fprintf(stderr, "Error: Tensor size exceeds maximum\n");
tensor->size = 0;
return;
}
// Initialize to zero
memset(tensor->data, 0, tensor->size * sizeof(float));
}
These memory optimization techniques are crucial for embedded systems with limited RAM. Now let’s look at optimizing for real-time performance.
Real-Time Inference Pipeline
Many embedded applications require real-time inference, where tensor operations must complete within strict time constraints. Let’s design a real-time inference pipeline for a simple neural network:
typedef struct {
// Layer weights and biases (quantized)
quant_tensor_t* weights1;
quant_tensor_t* bias1;
quant_tensor_t* weights2;
quant_tensor_t* bias2;
// Layer dimensions
int input_size;
int hidden_size;
int output_size;
// Pre-allocated buffers for intermediate results
quant_tensor_t* input_buffer;
quant_tensor_t* hidden_buffer;
quant_tensor_t* output_buffer;
// Memory pool for temporary allocations
memory_pool_t* pool;
} rt_network_t;
// Initialize the real-time network
rt_network_t* rt_network_create(int input_size, int hidden_size, int output_size,
float* weights1, float* bias1,
float* weights2, float* bias2) {
rt_network_t* net = (rt_network_t*)malloc(sizeof(rt_network_t));
if (!net) return NULL;
net->input_size = input_size;
net->hidden_size = hidden_size;
net->output_size = output_size;
// Create a memory pool for temporary allocations
net->pool = memory_pool_create(1024 * 1024); // 1MB pool
if (!net->pool) {
free(net);
return NULL;
}
// Create and quantize weights and biases
// (Implementation omitted for brevity)
// Pre-allocate buffers for intermediate results
// (Implementation omitted for brevity)
return net;
}
// Perform inference with the real-time network
void rt_network_inference(rt_network_t* net, float* input, float* output) {
// Reset the memory pool for this inference
memory_pool_reset(net->pool);
// Quantize the input
for (int i = 0; i < net->input_size; i++) {
// Simplified quantization for illustration
net->input_buffer->data[i] = (uint8_t)(input[i] * 255.0f);
}
// First layer: input -> hidden
quant_tensor_matmul_neon(net->input_buffer, net->weights1, net->hidden_buffer,
1, net->input_size, net->hidden_size);
// Add bias and apply ReLU
for (int i = 0; i < net->hidden_size; i++) {
int val = net->hidden_buffer->data[i] + net->bias1->data[i] - net->bias1->zero_point;
net->hidden_buffer->data[i] = (uint8_t)CLAMP(val, 0, 255);
// ReLU: max(0, x)
if (val < net->hidden_buffer->zero_point) {
net->hidden_buffer->data[i] = net->hidden_buffer->zero_point;
}
}
// Second layer: hidden -> output
quant_tensor_matmul_neon(net->hidden_buffer, net->weights2, net->output_buffer,
1, net->hidden_size, net->output_size);
// Add bias
for (int i = 0; i < net->output_size; i++) {
int val = net->output_buffer->data[i] + net->bias2->data[i] - net->bias2->zero_point;
net->output_buffer->data[i] = (uint8_t)CLAMP(val, 0, 255);
}
// Dequantize the output
for (int i = 0; i < net->output_size; i++) {
// Simplified dequantization for illustration
output[i] = ((int)net->output_buffer->data[i] - net->output_buffer->zero_point) *
net->output_buffer->scale;
}
}
This real-time inference pipeline uses pre-allocated buffers and a memory pool to avoid dynamic memory allocation during inference, which is crucial for predictable execution time.
Profiling and Benchmarking on Embedded Systems
To optimize tensor operations for embedded systems, you need to profile and benchmark your code on the target hardware. Here’s a simple benchmarking framework:
#include <time.h>
// Benchmark a function
double benchmark_function(void (*func)(void*), void* args, int iterations) {
struct timespec start, end;
// Warm-up run
func(args);
// Start timing
clock_gettime(CLOCK_MONOTONIC, &start);
// Run the function multiple times
for (int i = 0; i < iterations; i++) {
func(args);
}
// End timing
clock_gettime(CLOCK_MONOTONIC, &end);
// Calculate elapsed time in milliseconds
double elapsed = (end.tv_sec - start.tv_sec) * 1000.0;
elapsed += (end.tv_nsec - start.tv_nsec) / 1000000.0;
return elapsed / iterations;
}
// Example usage
typedef struct {
tensor_t* a;
tensor_t* b;
tensor_t* result;
} matmul_args_t;
void matmul_func(void* args) {
matmul_args_t* matmul_args = (matmul_args_t*)args;
tensor_matmul(matmul_args->a, matmul_args->b, matmul_args->result);
}
int main() {
// Create test tensors
int dims_a[2] = {64, 64};
int dims_b[2] = {64, 64};
int dims_result[2] = {64, 64};
tensor_t* a = tensor_create(2, dims_a);
tensor_t* b = tensor_create(2, dims_b);
tensor_t* result = tensor_create(2, dims_result);
// Initialize with random values
for (int i = 0; i < a->size; i++) a->data[i] = (float)rand() / RAND_MAX;
for (int i = 0; i < b->size; i++) b->data[i] = (float)rand() / RAND_MAX;
// Prepare benchmark arguments
matmul_args_t args = {a, b, result};
// Run benchmark
double avg_time = benchmark_function(matmul_func, &args, 100);
printf("Average execution time: %.3f ms\n", avg_time);
// Clean up
tensor_free(a);
tensor_free(b);
tensor_free(result);
return 0;
}
When profiling on embedded systems, consider these factors:
- CPU Utilization: Measure the percentage of CPU time your application uses.
- Memory Usage: Track peak memory usage and memory fragmentation.
- Cache Performance: Monitor cache hit/miss rates using hardware performance counters.
- Power Consumption: For battery-powered devices, measure power usage during different operations.
Case Study: Tensor Operations on Raspberry Pi
Let’s put everything together with a case study of optimizing tensor operations on a Raspberry Pi:
#include <stdio.h>
#include <stdlib.h>
#include <time.h>
#include "tensor.h"
#include "fixed_point.h"
#include "quantization.h"
// Benchmark different implementations
void benchmark_matrix_multiplication() {
printf("Benchmarking matrix multiplication implementations...\n");
// Create test matrices
int dims_a[2] = {128, 128};
int dims_b[2] = {128, 128};
int dims_c[2] = {128, 128};
tensor_t* a = tensor_create(2, dims_a);
tensor_t* b = tensor_create(2, dims_b);
tensor_t* c = tensor_create(2, dims_c);
// Initialize with random values
for (int i = 0; i < a->size; i++) a->data[i] = (float)rand() / RAND_MAX;
for (int i = 0; i < b->size; i++) b->data[i] = (float)rand() / RAND_MAX;
// Convert to fixed-point
fixed_tensor_t* a_fixed = tensor_to_fixed(a);
fixed_tensor_t* b_fixed = tensor_to_fixed(b);
fixed_tensor_t* c_fixed = fixed_tensor_create(2, dims_c);
// Quantize to 8-bit
quant_tensor_t* a_quant = tensor_quantize(a);
quant_tensor_t* b_quant = tensor_quantize(b);
quant_tensor_t* c_quant = (quant_tensor_t*)malloc(sizeof(quant_tensor_t));
c_quant->data = (uint8_t*)malloc(c->size * sizeof(uint8_t));
c_quant->scale = 0.1f;
c_quant->zero_point = 128;
c_quant->size = c->size;
// Benchmark floating-point implementation
struct timespec start, end;
clock_gettime(CLOCK_MONOTONIC, &start);
for (int i = 0; i < 10; i++) {
tensor_matmul(a, b, c);
}
clock_gettime(CLOCK_MONOTONIC, &end);
double float_time = (end.tv_sec - start.tv_sec) * 1000.0 +
(end.tv_nsec - start.tv_nsec) / 1000000.0;
float_time /= 10.0;
// Benchmark fixed-point implementation
clock_gettime(CLOCK_MONOTONIC, &start);
for (int i = 0; i < 10; i++) {
fixed_tensor_matmul(a_fixed, b_fixed, c_fixed);
}
clock_gettime(CLOCK_MONOTONIC, &end);
double fixed_time = (end.tv_sec - start.tv_sec) * 1000.0 +
(end.tv_nsec - start.tv_nsec) / 1000000.0;
fixed_time /= 10.0;
// Benchmark quantized implementation
clock_gettime(CLOCK_MONOTONIC, &start);
for (int i = 0; i < 10; i++) {
quant_tensor_matmul(a_quant, b_quant, c_quant, dims_a[0], dims_a[1], dims_b[1]);
}
clock_gettime(CLOCK_MONOTONIC, &end);
double quant_time = (end.tv_sec - start.tv_sec) * 1000.0 +
(end.tv_nsec - start.tv_nsec) / 1000000.0;
quant_time /= 10.0;
// Benchmark NEON-optimized quantized implementation
clock_gettime(CLOCK_MONOTONIC, &start);
for (int i = 0; i < 10; i++) {
quant_tensor_matmul_neon(a_quant, b_quant, c_quant, dims_a[0], dims_a[1], dims_b[1]);
}
clock_gettime(CLOCK_MONOTONIC, &end);
double neon_time = (end.tv_sec - start.tv_sec) * 1000.0 +
(end.tv_nsec - start.tv_nsec) / 1000000.0;
neon_time /= 10.0;
// Print results
printf("Floating-point: %.3f ms\n", float_time);
printf("Fixed-point: %.3f ms (%.1fx speedup)\n",
fixed_time, float_time / fixed_time);
printf("Quantized: %.3f ms (%.1fx speedup)\n",
quant_time, float_time / quant_time);
printf("NEON-optimized: %.3f ms (%.1fx speedup)\n",
neon_time, float_time / neon_time);
// Clean up
tensor_free(a);
tensor_free(b);
tensor_free(c);
fixed_tensor_free(a_fixed);
fixed_tensor_free(b_fixed);
fixed_tensor_free(c_fixed);
free(a_quant->data);
free(a_quant);
free(b_quant->data);
free(b_quant);
free(c_quant->data);
free(c_quant);
}
int main() {
// Seed random number generator
srand(time(NULL));
// Run benchmarks
benchmark_matrix_multiplication();
return 0;
}
Typical results on a Raspberry Pi 4 might look like:
Benchmarking matrix multiplication implementations...
Floating-point: 45.782 ms
Fixed-point: 12.345 ms (3.7x speedup)
Quantized: 5.678 ms (8.1x speedup)
NEON-optimized: 1.234 ms (37.1x speedup)
These results demonstrate the significant performance improvements possible with fixed-point arithmetic, quantization, and SIMD optimizations on embedded systems.
Summary
In this chapter, we’ve explored techniques for deploying tensor applications on embedded systems:
- Setting up a cross-compilation environment for ARM and other architectures
- Implementing fixed-point arithmetic for systems without FPU support
- Quantizing floating-point models to 8-bit integers
- Optimizing memory usage with pooling, reuse, and static allocation
- Building real-time inference pipelines with predictable performance
- Profiling and benchmarking tensor operations on embedded hardware
By applying these techniques, you can adapt the tensor library we’ve built throughout this book for efficient execution on resource-constrained devices. The key is to understand the specific constraints of your target platform and choose the appropriate optimizations.
Remember that embedded development often involves trade-offs between performance, memory usage, and accuracy. The best approach depends on your specific application requirements and hardware constraints.
Exercises
-
Implement a Fixed-Point Convolution Layer
Extend the fixed-point tensor implementation to support 2D convolution operations, which are common in image processing and computer vision applications.
Hint: Start with a naive implementation using nested loops, then optimize it with techniques like im2col to convert convolution to matrix multiplication.
// Partial solution: im2col function for fixed-point convolution
void fixed_im2col(fixed_tensor_t* input, fixed_tensor_t* output,
int kernel_h, int kernel_w, int stride_h, int stride_w,
int pad_h, int pad_w) {
// Assume input is [batch, height, width, channels]
int batch = input->dims[0];
int height = input->dims[1];
int width = input->dims[2];
int channels = input->dims[3];
// Calculate output dimensions
int out_h = (height + 2 * pad_h - kernel_h) / stride_h + 1;
int out_w = (width + 2 * pad_w - kernel_w) / stride_w + 1;
// Each column of output corresponds to a location in the output feature map
// Each row corresponds to a element in the kernel times input channels
int col_idx = 0;
for (int b = 0; b < batch; b++) {
for (int oh = 0; oh < out_h; oh++) {
for (int ow = 0; ow < out_w; ow++) {
for (int kh = 0; kh < kernel_h; kh++) {
for (int kw = 0; kw < kernel_w; kw++) {
int h = oh * stride_h - pad_h + kh;
int w = ow * stride_w - pad_w + kw;
if (h >= 0 && h < height && w >= 0 && w < width) {
for (int c = 0; c < channels; c++) {
int input_idx = ((b * height + h) * width + w) * channels + c;
int output_idx = (kh * kernel_w * channels + kw * channels + c) *
(batch * out_h * out_w) + col_idx;
output->data[output_idx] = input->data[input_idx];
}
} else {
// Zero padding
for (int c = 0; c < channels; c++) {
int output_idx = (kh * kernel_w * channels + kw * channels + c) *
(batch * out_h * out_w) + col_idx;
output->data[output_idx] = 0;
}
}
}
}
col_idx++;
}
}
}
}
-
Optimize Memory Usage for a Neural Network
Take the neural network implementation from Chapter 9 and optimize it for memory-constrained embedded systems. Use techniques like memory pooling, tensor reuse, and in-place operations to minimize memory usage.
Hint: Analyze the memory usage pattern of the forward and backward passes to identify opportunities for reuse.
-
Benchmark Different Quantization Schemes
Implement and compare different quantization schemes (symmetric vs. asymmetric, per-tensor vs. per-channel) for a neural network inference task. Measure both performance and accuracy on a real embedded device.
Hint: Start with a pre-trained model, quantize it using different schemes, and compare the results on a validation dataset.
Further Reading
- ARM NEON Programming Guide
- ARM Developer: https://developer.arm.com/architectures/instruction-sets/simd-isas/neon
- NEON Programmer’s Guide: https://developer.arm.com/documentation/den0018/latest/
- Embedded Deep Learning
- TensorFlow Lite for Microcontrollers: https://www.tensorflow.org/lite/microcontrollers
- Pete Warden’s “TinyML”: https://tinymlbook.com/
- Fixed-Point Arithmetic
- “Fixed-Point Arithmetic: An Introduction” by Randy Yates: http://www.digitalsignallabs.com/fp.pdf