Practical Memory Management in Linux: Handling Exhaustion with OOM Killer and Beyond
A comprehensive guide to Linux memory management, exploring OOM Killer behavior, swapping mechanisms, and practical strategies for handling memory exhaustion scenarios.
Core Concepts of Linux Memory Management
Memory Exhaustion Definition and Context
Memory exhaustion in Linux occurs when the system can no longer satisfy memory allocation requests from userspace processes or the kernel itself. Unlike some operating systems that strictly prevent allocations beyond physical memory, Linux implements memory overcommitment, allowing processes to allocate more virtual memory than physically available. This approach optimizes resource utilization based on the observation that many processes never use all their allocated memory simultaneously.
The kernel manages memory through a layered architecture:
- Physical memory management: Handles actual RAM through the buddy allocator system
- Virtual memory management: Creates the illusion of contiguous memory spaces for processes
- Paging subsystem: Transfers memory pages between RAM and secondary storage
When memory pressure increases, Linux employs a series of increasingly aggressive strategies to reclaim memory, culminating in the potential invocation of the OOM killer.
Simple Memory Exhaustion Example
Here’s a simple C program that gradually consumes memory until the system runs out:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
int main() {
const size_t chunk_size = 10 * 1024 * 1024; // 10 MB chunks
unsigned long allocated = 0;
printf("Starting memory exhaustion test\n");
printf("PID: %d\n", getpid());
while(1) {
void *mem = malloc(chunk_size);
if (mem == NULL) {
printf("malloc failed after allocating %lu MB\n", allocated / (1024 * 1024));
break;
}
// Actually touch the memory to ensure it's allocated
for (size_t i = 0; i < chunk_size; i += 4096) {
((char*)mem)[i] = 1;
}
allocated += chunk_size;
printf("Allocated %lu MB\n", allocated / (1024 * 1024));
sleep(1);
}
// This code will likely never execute due to OOM killer
printf("Exiting normally\n");
return 0;
}
Expected Output:
1
2
3
4
5
6
Starting memory exhaustion test
PID: 12345
Allocated 10 MB
Allocated 20 MB
...
Allocated 3540 MB
The program will continue allocating memory until either malloc
fails (unlikely due to overcommitment) or, more commonly, the system’s OOM killer terminates the process. The kernel log would show something like:
1
2
[601053.170127] Out of memory: Kill process 12345 (a.out) score 945 or sacrifice child
[601053.170134] Killed process 12345 (a.out) total-vm:3642MB, anon-rss:3541MB, file-rss:0KB, shmem-rss:0KB
Memory Allocation Strategies
Linux implements three principal overcommitment policies, configurable via the vm.overcommit_memory
sysctl parameter:
Heuristic overcommit (value 0): The default setting. The kernel uses a heuristic to determine whether to allow or deny memory allocations, considering factors like available swap space and current memory utilization.
Always overcommit (value 1): The kernel never refuses memory allocation requests, regardless of the system’s memory state. This policy maximizes memory utilization but increases the risk of the OOM killer being invoked.
Never overcommit (value 2): The kernel enforces strict accounting, allowing allocations only up to a limit defined by
vm.overcommit_ratio
orvm.overcommit_kbytes
. This is the most conservative approach, reducing system performance but increasing predictability.
The degree of overcommitment is further controlled by vm.overcommit_ratio
(default 50%), which determines the percentage of RAM that can be overcommitted when using the “never overcommit” policy.
Demonstrating Overcommit Policies
This C program demonstrates the behavior under different overcommit policies:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <unistd.h>
int main() {
const size_t gigabyte = 1024 * 1024 * 1024;
char command[256];
size_t attempt = 1;
printf("PID: %d\n", getpid());
printf("Attempting to allocate memory in 1GB chunks...\n");
// Get total RAM
sprintf(command, "free -h | grep 'Mem:' | awk '{print $2}'");
printf("Total RAM: ");
fflush(stdout);
system(command);
// Get overcommit policy
sprintf(command, "cat /proc/sys/vm/overcommit_memory");
printf("Current overcommit policy: ");
fflush(stdout);
system(command);
while(1) {
void *mem = malloc(gigabyte);
if (mem == NULL) {
printf("malloc failed on attempt %zu (requested 1GB)\n", attempt);
break;
}
// Touch the memory to ensure pages are allocated
memset(mem, 1, gigabyte);
printf("Successfully allocated %zu GB\n", attempt);
attempt++;
// Free the memory for demonstration purposes
// In a real memory exhaustion scenario, we would keep this allocated
free(mem);
}
return 0;
}
Expected Output with vm.overcommit_memory=0
(default):
1
2
3
4
5
6
7
8
9
PID: 12346
Attempting to allocate memory in 1GB chunks...
Total RAM: 8.0G
Current overcommit policy: 0
Successfully allocated 1 GB
...
Successfully allocated 12 GB
Successfully allocated 13 GB
malloc failed on attempt 14 (requested 1GB)
Expected Output with vm.overcommit_memory=1
:
1
2
3
4
5
6
7
8
9
PID: 12347
Attempting to allocate memory in 1GB chunks...
Total RAM: 8.0G
Current overcommit policy: 1
Successfully allocated 1 GB
...
Successfully allocated 49 GB
Successfully allocated 50 GB
[Process likely terminated by OOM killer]
Expected Output with vm.overcommit_memory=2
and vm.overcommit_ratio=50
:
1
2
3
4
5
6
7
8
PID: 12348
Attempting to allocate memory in 1GB chunks...
Total RAM: 8.0G
Current overcommit policy: 2
Successfully allocated 1 GB
...
Successfully allocated 11 GB
malloc failed on attempt 12 (requested 1GB)
The OOM Killer: Last Line of Defense
The Out-of-Memory killer serves as the kernel’s final defense mechanism when memory allocations cannot be satisfied through reclamation or swapping. Its primary function is to identify and terminate one or more processes to free sufficient memory for the system to continue functioning.
The OOM killer follows a sophisticated scoring system to select victim processes. Each process receives an oom_score
based on factors including:
- Memory consumption (both physical and virtual)
- Runtime duration (favoring shorter-lived processes)
- Process priority
- User-defined adjustments via
oom_score_adj
The selection criteria favor killing processes that:
- Consume substantial memory
- Have been running for a short time
- Have lower priority
- Are not critical system processes
This approach aims to minimize disruption while maximizing memory recovery, though its effectiveness depends on the specific workload and system configuration.
Examining OOM Scores
This program displays the OOM score of a process and allows adjusting it:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <string.h>
int main(int argc, char *argv[]) {
pid_t pid = getpid();
char filename[256];
char command[256];
FILE *file;
int score, adj;
// Print current process ID
printf("Process ID: %d\n", pid);
// Read current oom_score
sprintf(filename, "/proc/%d/oom_score", pid);
file = fopen(filename, "r");
if (file) {
fscanf(file, "%d", &score);
fclose(file);
printf("Current oom_score: %d\n", score);
} else {
perror("Failed to read oom_score");
}
// Read current oom_score_adj
sprintf(filename, "/proc/%d/oom_score_adj", pid);
file = fopen(filename, "r");
if (file) {
fscanf(file, "%d", &adj);
fclose(file);
printf("Current oom_score_adj: %d\n", adj);
} else {
perror("Failed to read oom_score_adj");
}
// Show scores of some system processes for comparison
printf("\nOOM scores of some system processes:\n");
system("ps -eo pid,comm | grep -E 'systemd|sshd|cron' | head -3 | while read pid comm; do echo \"$comm ($pid): $(cat /proc/$pid/oom_score 2>/dev/null || echo 'N/A')\"; done");
// If argument provided, try to adjust our oom_score_adj
if (argc > 1) {
int new_adj = atoi(argv[1]);
printf("\nAttempting to set oom_score_adj to %d\n", new_adj);
sprintf(command, "echo %d > /proc/%d/oom_score_adj", new_adj, pid);
if (system(command) != 0) {
printf("Failed to set oom_score_adj (try running with sudo)\n");
} else {
// Re-read oom_score
sprintf(filename, "/proc/%d/oom_score", pid);
file = fopen(filename, "r");
if (file) {
fscanf(file, "%d", &score);
fclose(file);
printf("New oom_score: %d\n", score);
}
}
}
printf("\nWaiting for 60 seconds so you can examine/modify this process...\n");
printf("Try: echo VALUE > /proc/%d/oom_score_adj\n", pid);
sleep(60);
return 0;
}
Expected Output:
1
2
3
4
5
6
7
8
9
10
11
Process ID: 12349
Current oom_score: 42
Current oom_score_adj: 0
OOM scores of some system processes:
systemd (1): 0
sshd (1234): 0
cron (1235): 0
Waiting for 60 seconds so you can examine/modify this process...
Try: echo VALUE > /proc/12349/oom_score_adj
Architectural Overview of Memory Management Under Pressure
Memory Allocation Request Flow
When a program requests memory through system calls like malloc()
or mmap()
, the request follows a complex path through the kernel:
User space allocation: Libraries like glibc translate application calls into appropriate system calls.
System call handling: The kernel processes the system call (e.g.,
brk()
,mmap()
) and updates the process’s virtual memory area (VMA) structures.Page table updates: The kernel modifies the process’s page tables but typically does not allocate physical memory immediately (lazy allocation).
Physical allocation: Physical memory is actually allocated when the process accesses the memory, triggering a page fault.
Under memory pressure, this sequence encounters additional checks and potential delays at each stage.
Tracking Memory Allocations
This C program demonstrates and tracks the path of memory allocation:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <string.h>
#include <sys/mman.h>
#include <sys/time.h>
#include <sys/resource.h>
// Function to print process memory stats
void print_memory_stats(const char* label) {
struct rusage usage;
getrusage(RUSAGE_SELF, &usage);
printf("\n--- %s ---\n", label);
printf(" RSS (reported by getrusage): %ld KB\n", usage.ru_maxrss);
// Use /proc/self/status for more detailed info
FILE* status = fopen("/proc/self/status", "r");
if (status) {
char line[256];
while (fgets(line, sizeof(line), status)) {
if (strncmp(line, "VmSize:", 7) == 0 ||
strncmp(line, "VmRSS:", 6) == 0 ||
strncmp(line, "VmData:", 7) == 0 ||
strncmp(line, "VmStk:", 6) == 0) {
printf(" %s", line);
}
}
fclose(status);
}
// Print address space maps
printf(" Memory maps (partial):\n");
system("head -5 /proc/self/maps");
}
int main() {
const size_t size = 100 * 1024 * 1024; // 100 MB
printf("Demonstrating different memory allocation paths\n");
// Initial state
print_memory_stats("Initial state");
// 1. malloc (uses brk for large allocations)
void* malloc_mem = malloc(size);
print_memory_stats("After malloc (no touch)");
// Touch memory to trigger actual allocation
memset(malloc_mem, 1, size);
print_memory_stats("After malloc (touched)");
// 2. Anonymous mmap
void* mmap_mem = mmap(NULL, size, PROT_READ | PROT_WRITE,
MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
if (mmap_mem == MAP_FAILED) {
perror("mmap failed");
return 1;
}
print_memory_stats("After mmap (no touch)");
// Touch memory from mmap
memset(mmap_mem, 2, size);
print_memory_stats("After mmap (touched)");
// 3. Try with MAP_POPULATE (pre-fault pages)
void* populate_mem = mmap(NULL, size, PROT_READ | PROT_WRITE,
MAP_PRIVATE | MAP_ANONYMOUS | MAP_POPULATE, -1, 0);
if (populate_mem == MAP_FAILED) {
perror("mmap with MAP_POPULATE failed");
return 1;
}
print_memory_stats("After mmap with MAP_POPULATE");
// Cleanup
free(malloc_mem);
munmap(mmap_mem, size);
munmap(populate_mem, size);
print_memory_stats("After cleanup");
return 0;
}
Expected Output:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
Demonstrating different memory allocation paths
--- Initial state ---
RSS (reported by getrusage): 3512 KB
VmSize: 4364 kB
VmRSS: 3512 kB
VmData: 504 kB
VmStk: 132 kB
Memory maps (partial):
555555554000-555555556000 r--p 00000000 08:01 1311053 /path/to/executable
555555556000-55555555a000 r-xp 00002000 08:01 1311053 /path/to/executable
55555555a000-55555555c000 r--p 00006000 08:01 1311053 /path/to/executable
55555555c000-55555555d000 r--p 00008000 08:01 1311053 /path/to/executable
55555555d000-55555555e000 rw-p 00009000 08:01 1311053 /path/to/executable
--- After malloc (no touch) ---
RSS (reported by getrusage): 3512 KB
VmSize: 104604 kB
VmRSS: 3512 kB
VmData: 100744 kB
VmStk: 132 kB
Memory maps (partial):
[Similar to above, but with data segment increased]
--- After malloc (touched) ---
RSS (reported by getrusage): 102400 KB
VmSize: 104604 kB
VmRSS: 102400 kB
VmData: 100744 kB
VmStk: 132 kB
Memory maps (partial):
[Similar to above]
--- After mmap (no touch) ---
RSS (reported by getrusage): 102400 KB
VmSize: 204604 kB
VmRSS: 102400 kB
VmData: 100744 kB
VmStk: 132 kB
Memory maps (partial):
[Now includes the anonymous mapping]
--- After mmap (touched) ---
RSS (reported by getrusage): 204800 KB
VmSize: 204604 kB
VmRSS: 204800 kB
VmData: 100744 kB
VmStk: 132 kB
Memory maps (partial):
[Similar to above]
--- After mmap with MAP_POPULATE ---
RSS (reported by getrusage): 307200 KB
VmSize: 304604 kB
VmRSS: 307200 kB
VmData: 100744 kB
VmStk: 132 kB
Memory maps (partial):
[Now includes the pre-populated mapping]
--- After cleanup ---
RSS (reported by getrusage): 3512 KB
VmSize: 4364 kB
VmRSS: 3512 kB
VmData: 504 kB
VmStk: 132 kB
Memory maps (partial):
[Returns to initial state]
Page Fault Handling During Memory Scarcity
Page faults occur when a process accesses memory that hasn’t been mapped to physical RAM. The kernel handles these faults through a sequence of steps:
Fault detection: The CPU generates an exception when accessing an unmapped address.
Fault classification: The kernel determines if the fault is legitimate (e.g., first access to a valid allocation) or invalid (e.g., segmentation fault).
Memory allocation: For legitimate faults, the kernel attempts to allocate a physical page.
Page table update: The kernel maps the physical page to the virtual address.
During memory pressure, the allocation stage may trigger memory reclamation, potentially forcing the kernel to swap out existing pages or invoke the OOM killer if reclamation fails.
Demonstrating Page Fault Handling
This program demonstrates the relationship between page faults and memory access:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <unistd.h>
#include <signal.h>
#include <sys/mman.h>
#include <sys/resource.h>
void print_fault_stats() {
struct rusage usage;
getrusage(RUSAGE_SELF, &usage);
printf("Major page faults: %ld, Minor page faults: %ld\n",
usage.ru_majflt, usage.ru_minflt);
}
int main() {
const size_t page_size = 4096;
const size_t alloc_size = 100 * 1024 * 1024; // 100 MB
const size_t pages = alloc_size / page_size;
printf("Demonstrating page fault behavior\n");
printf("Initial stats:\n");
print_fault_stats();
// Allocate memory
printf("\nAllocating %zu MB using mmap...\n", alloc_size / (1024 * 1024));
char* memory = mmap(NULL, alloc_size, PROT_READ | PROT_WRITE,
MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
if (memory == MAP_FAILED) {
perror("mmap failed");
return 1;
}
printf("After allocation (no access):\n");
print_fault_stats();
// Access first page
printf("\nAccessing first page...\n");
memory[0] = 1;
printf("After accessing first page:\n");
print_fault_stats();
// Access pages with increments to show fault behavior
printf("\nAccessing every 1000th page...\n");
for (size_t i = 0; i < pages; i += 1000) {
memory[i * page_size] = 1;
}
printf("After accessing sparse pages:\n");
print_fault_stats();
// Access all pages
printf("\nAccessing every page (may take a moment)...\n");
for (size_t i = 0; i < pages; i++) {
memory[i * page_size] = 1;
}
printf("After accessing all pages:\n");
print_fault_stats();
// Try to force a major fault by reclaiming memory
printf("\nAttempting to force major page faults...\n");
printf("Running: echo 3 > /proc/sys/vm/drop_caches\n");
printf("Note: This may require sudo privileges\n");
system("sudo sh -c 'echo 3 > /proc/sys/vm/drop_caches' 2>/dev/null || echo 'Failed to drop caches (no sudo)'");
// Access memory again after cache drop
printf("\nAccessing memory after cache drop...\n");
for (size_t i = 0; i < pages; i += 100) {
memory[i * page_size] = 2;
}
printf("Final stats:\n");
print_fault_stats();
// Clean up
munmap(memory, alloc_size);
return 0;
}
Expected Output:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
Demonstrating page fault behavior
Initial stats:
Major page faults: 0, Minor page faults: 281
Allocating 100 MB using mmap...
After allocation (no access):
Major page faults: 0, Minor page faults: 284
Accessing first page...
After accessing first page:
Major page faults: 0, Minor page faults: 285
Accessing every 1000th page...
After accessing sparse pages:
Major page faults: 0, Minor page faults: 311
Accessing every page (may take a moment)...
After accessing all pages:
Major page faults: 0, Minor page faults: 25786
Attempting to force major page faults...
Running: echo 3 > /proc/sys/vm/drop_caches
Note: This may require sudo privileges
Failed to drop caches (no sudo)
Accessing memory after cache drop...
Final stats:
Major page faults: 0, Minor page faults: 25786
If successful with sudo permissions, you would see major page faults increase after the cache drop.
Interactions Between Memory Subsystems
Linux memory management relies on the cooperation of several subsystems:
Buddy allocator: Manages physical memory pages in power-of-two blocks.
Slab allocator: Provides efficient allocation for kernel data structures of standard sizes.
Page cache: Caches file data in memory to accelerate I/O operations.
Swap subsystem: Transfers memory pages between RAM and swap devices.
Memory policy enforcement: Applies constraints from cgroups, mempolicy, and other mechanisms.
During memory exhaustion, these components communicate via a complex signaling system that balances immediate needs against long-term performance considerations.
Memory Reclamation and Swapping Mechanisms
Page Cache Reclamation
The page cache is a primary target for memory reclamation since its contents can be regenerated from backing storage if needed. Linux maintains two LRU (Least Recently Used) lists:
- Active list: Contains recently accessed pages
- Inactive list: Contains pages that haven’t been accessed recently
The kernel periodically moves pages between these lists based on access patterns, reclaiming pages from the inactive list when memory pressure increases. This process is primarily managed by the kernel thread kswapd
.
The vm.vfs_cache_pressure
parameter (default 100) controls the kernel’s eagerness to reclaim page cache pages versus anonymous pages. Higher values increase page cache reclamation pressure.
Testing Page Cache Behavior
This C program demonstrates page cache usage and reclamation:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <string.h>
#include <fcntl.h>
#include <time.h>
#include <sys/stat.h>
#include <sys/types.h>
#define TEST_FILE "pagecache_test.bin"
#define FILE_SIZE (100 * 1024 * 1024) // 100 MB
#define BUFFER_SIZE (4 * 1024) // 4 KB
void print_memory_info() {
printf("Page cache status:\n");
system("grep -E 'Cached|Buffers' /proc/meminfo");
printf("\n");
}
// Function to measure read time
double measure_read_time(const char* filename) {
int fd = open(filename, O_RDONLY);
if (fd == -1) {
perror("open failed");
return -1.0;
}
char buffer[BUFFER_SIZE];
size_t total_read = 0;
clock_t start = clock();
while (total_read < FILE_SIZE) {
ssize_t bytes_read = read(fd, buffer, BUFFER_SIZE);
if (bytes_read <= 0) break;
total_read += bytes_read;
}
clock_t end = clock();
close(fd);
return (double)(end - start) / CLOCKS_PER_SEC;
}
int main() {
printf("Demonstrating page cache behavior\n");
print_memory_info();
// Create test file
printf("Creating %d MB test file...\n", FILE_SIZE / (1024 * 1024));
int fd = open(TEST_FILE, O_WRONLY | O_CREAT | O_TRUNC, 0644);
if (fd == -1) {
perror("Failed to create test file");
return 1;
}
char buffer[BUFFER_SIZE];
memset(buffer, 'A', BUFFER_SIZE);
for (size_t written = 0; written < FILE_SIZE; written += BUFFER_SIZE) {
if (write(fd, buffer, BUFFER_SIZE) != BUFFER_SIZE) {
perror("write failed");
close(fd);
return 1;
}
}
close(fd);
// Clear caches to ensure first read isn't cached
printf("Dropping caches (may require sudo)...\n");
system("sudo sh -c 'echo 3 > /proc/sys/vm/drop_caches' 2>/dev/null || echo 'Failed (no sudo)'");
print_memory_info();
// First read (cold cache)
printf("First read (cold cache)...\n");
double time1 = measure_read_time(TEST_FILE);
printf("Read time: %.3f seconds\n", time1);
print_memory_info();
// Second read (warm cache)
printf("Second read (warm cache)...\n");
double time2 = measure_read_time(TEST_FILE);
printf("Read time: %.3f seconds\n", time2);
printf("Speedup factor: %.2fx\n", time1 / time2);
print_memory_info();
// Try to force page cache reclamation under memory pressure
printf("\nSimulating memory pressure...\n");
printf("Allocating large memory block...\n");
// Allocate large chunk to create memory pressure
void* mem = malloc(FILE_SIZE * 2); // 2x file size
if (mem != NULL) {
// Touch pages to ensure allocation
for (size_t i = 0; i < FILE_SIZE * 2; i += 4096) {
((char*)mem)[i] = 1;
}
printf("Memory allocated and touched\n");
print_memory_info();
// Read file again under memory pressure
printf("Reading file under memory pressure...\n");
double time3 = measure_read_time(TEST_FILE);
printf("Read time under pressure: %.3f seconds\n", time3);
printf("Pressure vs. warm cache ratio: %.2fx\n", time3 / time2);
free(mem);
} else {
printf("Failed to allocate memory for pressure test\n");
}
// Clean up
unlink(TEST_FILE);
printf("Test file removed\n");
return 0;
}
Expected Output:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
Demonstrating page cache behavior
Page cache status:
Cached: 2345678 kB
Buffers: 123456 kB
Creating 100 MB test file...
Dropping caches (may require sudo)...
Failed (no sudo)
Page cache status:
Cached: 2345678 kB
Buffers: 123456 kB
First read (cold cache)...
Read time: 0.245 seconds
Page cache status:
Cached: 2445678 kB
Buffers: 123456 kB
Second read (warm cache)...
Read time: 0.012 seconds
Speedup factor: 20.42x
Page cache status:
Cached: 2445678 kB
Buffers: 123456 kB
Simulating memory pressure...
Allocating large memory block...
Memory allocated and touched
Page cache status:
Cached: 2345678 kB
Buffers: 123456 kB
Reading file under memory pressure...
Read time under pressure: 0.198 seconds
Pressure vs. warm cache ratio: 16.50x
Test file removed
The actual values will vary significantly based on system configuration, load, and hardware.
Transparent Huge Pages Reclamation
Transparent Huge Pages (THP) complicate memory reclamation due to their size (typically 2MB versus 4KB for standard pages). When memory pressure occurs, the kernel may need to break huge pages into standard pages before reclamation, introducing additional overhead. This is managed by the khugepaged
daemon.
Demonstrating THP Impact
This program shows the impact of Transparent Huge Pages:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <string.h>
#include <fcntl.h>
#include <sys/mman.h>
#include <time.h>
#define GB (1024 * 1024 * 1024UL)
#define MB (1024 * 1024UL)
// Function to check if THP is enabled
void check_thp_status() {
printf("THP Configuration:\n");
// Read current THP state
int fd = open("/sys/kernel/mm/transparent_hugepage/enabled", O_RDONLY);
if (fd >= 0) {
char buffer[256];
ssize_t n = read(fd, buffer, sizeof(buffer) - 1);
if (n > 0) {
buffer[n] = '\0';
printf(" enabled: %s", buffer);
}
close(fd);
} else {
printf(" enabled: [unable to read]\n");
}
// Read defrag state
fd = open("/sys/kernel/mm/transparent_hugepage/defrag", O_RDONLY);
if (fd >= 0) {
char buffer[256];
ssize_t n = read(fd, buffer, sizeof(buffer) - 1);
if (n > 0) {
buffer[n] = '\0';
printf(" defrag: %s", buffer);
}
close(fd);
}
// Check khugepaged status
printf("khugepaged stats:\n");
system("grep -A2 huge /proc/meminfo");
}
// Function to measure access time for a memory region
double measure_access_time(void* ptr, size_t size, size_t stride) {
// Ensure the memory is actually allocated by touching it first
for (size_t i = 0; i < size; i += stride) {
((char*)ptr)[i] = 1;
}
// Measure access time
clock_t start = clock();
for (size_t i = 0; i < size; i += stride) {
((char*)ptr)[i] += 1;
}
clock_t end = clock();
return (double)(end - start) / CLOCKS_PER_SEC;
}
int main() {
printf("Demonstrating Transparent Huge Page (THP) behavior\n\n");
check_thp_status();
// Allocate memory with and without THP hints
printf("\nAllocating 1GB with and without THP hints...\n");
// Regular allocation
printf("Regular mmap allocation:\n");
void* regular_mem = mmap(NULL, 1 * GB, PROT_READ | PROT_WRITE,
MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
if (regular_mem == MAP_FAILED) {
perror("Regular mmap failed");
return 1;
}
// THP-friendly allocation with MADV_HUGEPAGE
printf("THP-friendly allocation (MADV_HUGEPAGE):\n");
void* thp_mem = mmap(NULL, 1 * GB, PROT_READ | PROT_WRITE,
MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
if (thp_mem == MAP_FAILED) {
perror("THP mmap failed");
munmap(regular_mem, 1 * GB);
return 1;
}
// Advise the kernel to use huge pages for this region
if (madvise(thp_mem, 1 * GB, MADV_HUGEPAGE) != 0) {
perror("madvise(MADV_HUGEPAGE) failed");
}
// Touch both memory regions to ensure allocation
printf("Touching memory to trigger allocation...\n");
double regular_time = measure_access_time(regular_mem, 1 * GB, 4096);
double thp_time = measure_access_time(thp_mem, 1 * GB, 4096);
printf("Regular memory initial access time: %.3f seconds\n", regular_time);
printf("THP memory initial access time: %.3f seconds\n", thp_time);
// Check memory stats after allocation
printf("\nMemory status after allocation:\n");
system("grep -A2 huge /proc/meminfo");
// Create memory pressure to force THP splitting
printf("\nCreating memory pressure to potentially split huge pages...\n");
void* pressure_mem = NULL;
for (size_t size = 512 * MB; size >= 64 * MB; size /= 2) {
pressure_mem = malloc(size);
if (pressure_mem) {
printf("Allocated %zu MB for pressure test\n", size / MB);
// Touch memory to ensure allocation
memset(pressure_mem, 1, size);
break;
}
}
if (pressure_mem) {
// Re-measure access times after pressure
double regular_time2 = measure_access_time(regular_mem, 1 * GB, 4096);
double thp_time2 = measure_access_time(thp_mem, 1 * GB, 4096);
printf("\nAfter memory pressure:\n");
printf("Regular memory access time: %.3f seconds (%.2f%% change)\n",
regular_time2, (regular_time2 / regular_time - 1) * 100);
printf("THP memory access time: %.3f seconds (%.2f%% change)\n",
thp_time2, (thp_time2 / thp_time - 1) * 100);
free(pressure_mem);
} else {
printf("Failed to allocate memory for pressure test\n");
}
// Check memory status after pressure
printf("\nMemory status after pressure:\n");
system("grep -A2 huge /proc/meminfo");
// Clean up
munmap(regular_mem, 1 * GB);
munmap(thp_mem, 1 * GB);
return 0;
}
Expected Output:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
Demonstrating Transparent Huge Page (THP) behavior
THP Configuration:
enabled: [always] madvise never
defrag: always defer defer+madvise madvise never
khugepaged stats:
AnonHugePages: 57344 kB
ShmemHugePages: 0 kB
FileHugePages: 0 kB
Allocating 1GB with and without THP hints...
Regular mmap allocation:
THP-friendly allocation (MADV_HUGEPAGE):
Touching memory to trigger allocation...
Regular memory initial access time: 0.247 seconds
THP memory initial access time: 0.198 seconds
Memory status after allocation:
AnonHugePages: 983040 kB
ShmemHugePages: 0 kB
FileHugePages: 0 kB
Creating memory pressure to potentially split huge pages...
Allocated 512 MB for pressure test
After memory pressure:
Regular memory access time: 0.312 seconds (26.32% change)
THP memory access time: 0.205 seconds (3.54% change)
Memory status after pressure:
AnonHugePages: 589824 kB
ShmemHugePages: 0 kB
FileHugePages: 0 kB
The output shows that under memory pressure, THP-enabled memory tends to maintain better performance as the kernel tries to preserve huge pages when possible.
Anonymous Memory and Swapping
Anonymous memory (memory not backed by files) requires special handling during reclamation since it must be written to swap space before being reclaimed. This process follows several steps:
Swap candidate selection: The kernel identifies anonymous pages that haven’t been accessed recently.
Swap space allocation: The kernel allocates space on a swap device or file.
Page writeout: The kernel writes the page contents to swap space.
Page table update: The kernel marks the page as swapped out.
The efficiency of this process depends heavily on the configuration of the swap subsystem and the performance of the underlying storage devices.
Examining Swap Behavior
This program demonstrates swap activity under memory pressure:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <unistd.h>
#include <time.h>
#define MB (1024 * 1024)
// Function to print memory and swap usage
void print_memory_stats() {
printf("Memory usage:\n");
system("free -m");
printf("\nSwap details:\n");
system("cat /proc/swaps");
printf("\n");
}
// Function to allocate and access memory incrementally
void test_memory_pressure(size_t max_mb, size_t step_mb, int interval_sec) {
void** blocks = calloc(max_mb / step_mb, sizeof(void*));
if (!blocks) {
perror("Failed to allocate blocks array");
return;
}
printf("Starting memory pressure test (%zu MB, %zu MB steps, %d sec intervals)\n",
max_mb, step_mb, interval_sec);
print_memory_stats();
size_t allocated = 0;
unsigned int block_index = 0;
while (allocated < max_mb) {
// Allocate next block
size_t block_size = step_mb * MB;
void* block = malloc(block_size);
if (!block) {
printf("Failed to allocate block at %zu MB\n", allocated);
break;
}
// Touch memory to ensure it's actually allocated
memset(block, 1, block_size);
// Update tracking
blocks[block_index++] = block;
allocated += step_mb;
printf("\nAllocated and accessed %zu MB total\n", allocated);
print_memory_stats();
// Pause to allow swapping
sleep(interval_sec);
}
// Release memory in reverse order
printf("\nReleasing memory...\n");
while (block_index > 0) {
free(blocks[--block_index]);
allocated -= step_mb;
if (block_index % 5 == 0) {
printf("Released down to %zu MB\n", allocated);
print_memory_stats();
sleep(interval_sec);
}
}
free(blocks);
printf("\nTest completed\n");
print_memory_stats();
}
int main() {
// Adjust these parameters based on your system
size_t system_ram_mb = 0;
// Try to detect system RAM
FILE* meminfo = fopen("/proc/meminfo", "r");
if (meminfo) {
char line[256];
while (fgets(line, sizeof(line), meminfo)) {
unsigned long ram_kb;
if (sscanf(line, "MemTotal: %lu kB", &ram_kb) == 1) {
system_ram_mb = ram_kb / 1024;
break;
}
}
fclose(meminfo);
}
if (system_ram_mb == 0) {
printf("Could not detect system RAM, assuming 8 GB\n");
system_ram_mb = 8 * 1024;
}
printf("Detected system RAM: %zu MB\n", system_ram_mb);
printf("Swap configuration:\n");
system("swapon --show");
// Calculate test parameters
size_t max_mb = system_ram_mb / 2; // Test with half of RAM
size_t step_mb = max_mb / 20; // 20 steps
if (step_mb < 50) step_mb = 50; // Minimum 50 MB steps
int interval_sec = 2; // 2 second intervals
printf("\nSwappiness setting: ");
system("cat /proc/sys/vm/swappiness");
// Run the test
test_memory_pressure(max_mb, step_mb, interval_sec);
return 0;
}
Expected Output:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
Detected system RAM: 8192 MB
Swap configuration:
NAME TYPE SIZE USED PRIO
/dev/sda2 partition 4G 0B -2
Swappiness setting: 60
Starting memory pressure test (4096 MB, 204 MB steps, 2 sec intervals)
Memory usage:
total used free shared buff/cache available
Mem: 7872 1824 4750 156 1297 5712
Swap: 4095 0 4095
Allocated and accessed 204 MB total
Memory usage:
total used free shared buff/cache available
Mem: 7872 2028 4546 156 1297 5508
Swap: 4095 0 4095
[intermediate steps omitted]
Allocated and accessed 3876 MB total
Memory usage:
total used free shared buff/cache available
Mem: 7872 5700 874 156 1297 1836
Swap: 4095 72 4023
Allocated and accessed 4080 MB total
Memory usage:
total used free shared buff/cache available
Mem: 7872 5904 670 156 1297 1632
Swap: 4095 226 3869
Releasing memory...
Released down to 3060 MB
Memory usage:
total used free shared buff/cache available
Mem: 7872 4884 1690 156 1297 2652
Swap: 4095 196 3899
[intermediate steps omitted]
Released down to 0 MB
Memory usage:
total used free shared buff/cache available
Mem: 7872 1824 4750 156 1297 5712
Swap: 4095 0 4095
Test completed
Memory usage:
total used free shared buff/cache available
Mem: 7872 1824 4750 156 1297 5712
Swap: 4095 0 4095
This output demonstrates how Linux starts using swap space as memory pressure increases, and how it reclaims pages from swap when memory is freed.
The kswapd Daemon and Direct Reclaim
Linux employs two primary mechanisms for memory reclamation:
kswapd-based reclamation: A background daemon that periodically scans memory when free memory falls below a threshold (
vm.min_free_kbytes
). It works asynchronously to maintain adequate free memory without disrupting foreground tasks.Direct reclaim: Triggered when a memory allocation fails and the allocating process must wait until sufficient memory is reclaimed. This synchronous approach introduces latency but ensures forward progress.
The balance between these approaches is controlled by several parameters, including vm.swappiness
(default 60), which influences the kernel’s preference for swapping anonymous pages versus reclaiming page cache pages.
OOM Killer Mechanics
Scoring Algorithm and Process Selection
The OOM killer’s scoring algorithm evaluates all user processes using a formula that considers:
- Memory usage: Both resident set size (RSS) and virtual memory size
- Runtime: Processes running for shorter periods receive higher scores
- Nice value: Processes with higher nice values (lower priority) receive higher scores
- OOM adjustment: User-defined adjustments through
oom_score_adj
The formula approximately follows:
1
oom_score = (memory_usage_in_pages * 10) / (uptime_in_seconds * sqrt(sqrt(uptime_in_seconds)))
This is then modified by oom_score_adj
, which ranges from -1000 (never kill) to 1000 (kill first).
When invoked, the OOM killer calculates scores for all eligible processes and terminates the one with the highest score.
OOM Killer Simulation
This program simulates OOM killer behavior by implementing a simplified scoring algorithm:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <unistd.h>
#include <dirent.h>
#include <ctype.h>
#include <math.h>
#include <time.h>
// Structure to hold process information
typedef struct {
int pid;
char name[256];
unsigned long rss; // Resident Set Size in KB
unsigned long vm_size; // Virtual Memory Size in KB
long long start_time; // Process start time in seconds since boot
int oom_score; // Current OOM score
int oom_score_adj; // OOM score adjustment
double calculated_score; // Our calculated score
} process_info_t;
// Function to get system uptime in seconds
double get_system_uptime() {
FILE* uptime_file = fopen("/proc/uptime", "r");
if (!uptime_file) {
perror("Failed to open /proc/uptime");
return 0;
}
double uptime;
if (fscanf(uptime_file, "%lf", &uptime) != 1) {
uptime = 0;
}
fclose(uptime_file);
return uptime;
}
// Function to read process info from /proc
int get_process_info(process_info_t* proc) {
char path[256];
FILE* file;
char line[256];
// Read process name from /proc/[pid]/comm
sprintf(path, "/proc/%d/comm", proc->pid);
file = fopen(path, "r");
if (!file) return 0;
if (fgets(proc->name, sizeof(proc->name), file)) {
// Remove trailing newline
size_t len = strlen(proc->name);
if (len > 0 && proc->name[len-1] == '\n') {
proc->name[len-1] = '\0';
}
}
fclose(file);
// Read memory info from /proc/[pid]/status
sprintf(path, "/proc/%d/status", proc->pid);
file = fopen(path, "r");
if (!file) return 0;
proc->rss = 0;
proc->vm_size = 0;
while (fgets(line, sizeof(line), file)) {
if (strncmp(line, "VmSize:", 7) == 0) {
proc->vm_size = strtoul(line + 7, NULL, 10);
} else if (strncmp(line, "VmRSS:", 6) == 0) {
proc->rss = strtoul(line + 6, NULL, 10);
}
}
fclose(file);
// Read OOM score
sprintf(path, "/proc/%d/oom_score", proc->pid);
file = fopen(path, "r");
if (file) {
if (fscanf(file, "%d", &proc->oom_score) != 1) {
proc->oom_score = 0;
}
fclose(file);
}
// Read OOM score adjustment
sprintf(path, "/proc/%d/oom_score_adj", proc->pid);
file = fopen(path, "r");
if (file) {
if (fscanf(file, "%d", &proc->oom_score_adj) != 1) {
proc->oom_score_adj = 0;
}
fclose(file);
}
// Read process start time
sprintf(path, "/proc/%d/stat", proc->pid);
file = fopen(path, "r");
if (file) {
// Parse the stat file (22nd field is starttime)
int i;
char *token;
char stat_content[1024];
if (fgets(stat_content, sizeof(stat_content), file)) {
token = strtok(stat_content, " ");
for (i = 1; i < 22 && token != NULL; i++) {
token = strtok(NULL, " ");
}
if (token != NULL) {
proc->start_time = strtoll(token, NULL, 10);
}
}
fclose(file);
}
return 1;
}
// Simplified OOM score calculation
double calculate_oom_score(process_info_t* proc, double system_uptime) {
// Convert start_time to seconds of runtime
double runtime = system_uptime - (proc->start_time / sysconf(_SC_CLK_TCK));
if (runtime < 1.0) runtime = 1.0; // Avoid division by zero
// Convert KB to pages (assuming 4KB pages)
double pages = proc->rss / 4.0;
// Simplified version of the formula
double score = (pages * 10.0) / (runtime * sqrt(sqrt(runtime)));
// Apply OOM score adjustment
score = score * (1000.0 - proc->oom_score_adj) / 1000.0;
return score;
}
// Comparison function for qsort
int compare_scores(const void* a, const void* b) {
process_info_t* p1 = (process_info_t*)a;
process_info_t* p2 = (process_info_t*)b;
if (p1->calculated_score > p2->calculated_score) return -1;
if (p1->calculated_score < p2->calculated_score) return 1;
return 0;
}
int main() {
DIR* proc_dir;
struct dirent* entry;
process_info_t processes[500]; // Adjust size as needed
int process_count = 0;
double system_uptime = get_system_uptime();
printf("OOM Killer Simulator\n");
printf("System uptime: %.2f seconds\n\n", system_uptime);
// Open /proc directory
proc_dir = opendir("/proc");
if (!proc_dir) {
perror("Failed to open /proc");
return 1;
}
// Scan for process directories
while ((entry = readdir(proc_dir)) != NULL && process_count < 500) {
// Check if entry is a directory and name is a number (PID)
if (entry->d_type == DT_DIR) {
char* endptr;
long pid = strtol(entry->d_name, &endptr, 10);
if (*endptr == '\0') { // Valid PID
processes[process_count].pid = (int)pid;
if (get_process_info(&processes[process_count])) {
// Calculate our score
processes[process_count].calculated_score =
calculate_oom_score(&processes[process_count], system_uptime);
process_count++;
}
}
}
}
closedir(proc_dir);
// Sort processes by calculated score
qsort(processes, process_count, sizeof(process_info_t), compare_scores);
// Print top 20 most likely victims
printf("Top 20 processes most likely to be killed by the OOM killer:\n");
printf("%-8s %-20s %-10s %-10s %-15s %-10s %-12s %-12s\n",
"PID", "Name", "RSS(MB)", "VM(MB)", "Runtime(s)", "OOM Score", "OOM Adj", "Calc Score");
printf("%-8s %-20s %-10s %-10s %-15s %-10s %-12s %-12s\n",
"--------", "--------------------", "----------", "----------", "---------------", "----------", "------------", "------------");
for (int i = 0; i < 20 && i < process_count; i++) {
double runtime = system_uptime - (processes[i].start_time / sysconf(_SC_CLK_TCK));
printf("%-8d %-20.20s %-10.1f %-10.1f %-15.1f %-10d %-12d %-12.1f\n",
processes[i].pid,
processes[i].name,
processes[i].rss / 1024.0, // Convert KB to MB
processes[i].vm_size / 1024.0, // Convert KB to MB
runtime,
processes[i].oom_score,
processes[i].oom_score_adj,
processes[i].calculated_score);
}
printf("\nNote: This is a simplified simulation. The actual Linux OOM killer\n");
printf("uses more complex heuristics and considers additional factors.\n");
return 0;
}
Expected Output:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
OOM Killer Simulator
System uptime: 86400.25 seconds
Top 20 processes most likely to be killed by the OOM killer:
PID Name RSS(MB) VM(MB) Runtime(s) OOM Score OOM Adj Calc Score
-------- -------------------- ---------- ---------- --------------- ---------- ------------ ------------
23456 firefox 1250.5 3045.2 3600.5 541 0 432.1
12345 chrome 1150.2 2845.7 7200.3 498 0 321.5
34567 java 825.3 1567.4 21600.2 321 0 198.7
45678 node 678.9 1245.6 43200.1 256 0 145.2
56789 mysql 512.4 875.3 86400.2 189 -100 87.3
... [additional processes] ...
Note: This is a simplified simulation. The actual Linux OOM killer
uses more complex heuristics and considers additional factors.
This simulation helps visualize how the OOM killer scores processes and selects victims during memory exhaustion.
Badness Function Implementation
The kernel implements this logic in the oom_badness()
function, which examines each process’s memory consumption and applies various weighting factors. The function considers:
- Pages mapped in multiple locations (counted only once)
- Pages that can be easily reclaimed
- Process hierarchies (child processes contribute to parent scores)
- System criticality flags
The resulting score is cached but recalculated periodically to account for changing conditions.
Process Termination Sequence
When the OOM killer selects a victim, it follows a systematic termination procedure:
- Signal delivery: The process receives a SIGKILL signal.
- Memory release: The kernel begins reclaiming the process’s memory.
- Resource cleanup: Other resources (file descriptors, locks) are released.
- Parent notification: The parent process is notified via the standard signal mechanism.
- Logging: The event is logged to the kernel log with details about the selection criteria.
In some cases, multiple processes may need to be terminated before sufficient memory is reclaimed.
OOM Control via Cgroups
Control groups (cgroups) provide fine-grained control over the OOM killer’s behavior. Each cgroup can specify:
- Memory limits (
memory.limit_in_bytes
) - OOM control policy (
memory.oom_control
) - OOM score adjustments (
memory.oom_score_adj
)
This allows system administrators to isolate critical services from less important ones, ensuring that OOM events affect only non-essential components.
Demonstrating Cgroup Memory Limits
This script demonstrates how to create and use memory cgroups to control OOM behavior:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
#!/bin/bash
# Note: This script requires root privileges to run
# Check if running as root
if [[ $EUID -ne 0 ]]; then
echo "This script must be run as root"
exit 1
fi
# Ensure cgroup filesystem is mounted
if ! grep -q "cgroup" /proc/mounts; then
echo "Cgroup filesystem not mounted. Please ensure cgroups are enabled."
exit 1
fi
# Create a test cgroup
CGROUP_PATH="/sys/fs/cgroup/memory/test-oom"
mkdir -p $CGROUP_PATH
# Create a simple C program for testing
cat > oom_test.c << 'EOF'
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <unistd.h>
int main(int argc, char *argv[]) {
size_t chunk_size = 10 * 1024 * 1024; // 10 MB chunks
int delay_ms = 100; // Delay between allocations in milliseconds
size_t limit_mb = 0; // 0 means unlimited
if (argc > 1) chunk_size = atol(argv[1]) * 1024 * 1024;
if (argc > 2) delay_ms = atoi(argv[2]);
if (argc > 3) limit_mb = atol(argv[3]);
printf("PID: %d\n", getpid());
printf("Allocating memory in %zu MB chunks every %d ms\n",
chunk_size / (1024 * 1024), delay_ms);
if (limit_mb) printf("Will stop after allocating %zu MB\n", limit_mb);
printf("Press Ctrl+C to stop\n\n");
size_t total_allocated = 0;
while (1) {
void *mem = malloc(chunk_size);
if (mem == NULL) {
printf("malloc failed after %zu MB\n", total_allocated / (1024 * 1024));
break;
}
// Touch memory to ensure it's allocated
memset(mem, 1, chunk_size);
total_allocated += chunk_size;
printf("Allocated %zu MB total\n", total_allocated / (1024 * 1024));
if (limit_mb > 0 && total_allocated >= limit_mb * 1024 * 1024) {
printf("Reached allocation limit of %zu MB\n", limit_mb);
break;
}
usleep(delay_ms * 1000);
}
printf("Allocation complete. Sleeping forever...\n");
while(1) sleep(60);
return 0;
}
EOF
# Compile the test program
gcc -o oom_test oom_test.c
# Configure the cgroup
echo "1" > $CGROUP_PATH/memory.oom_control # 1 = disable the OOM killer
echo "100M" > $CGROUP_PATH/memory.limit_in_bytes # 100 MB limit
echo "0" > $CGROUP_PATH/memory.swappiness # Disable swapping
echo "Created cgroup with the following settings:"
echo "Memory limit: $(cat $CGROUP_PATH/memory.limit_in_bytes)"
echo "OOM control: $(cat $CGROUP_PATH/memory.oom_control)"
echo "Swappiness: $(cat $CGROUP_PATH/memory.swappiness)"
# Now run our test program in the cgroup
echo "Running test program in cgroup..."
echo $$ > $CGROUP_PATH/cgroup.procs
# Start monitoring
(
while true; do
echo "Current memory usage: $(cat $CGROUP_PATH/memory.usage_in_bytes) bytes"
sleep 1
done
) &
MONITOR_PID=$!
# Run the test program with 10MB chunks, 500ms delay, no upper limit
./oom_test 10 500 0
# Clean up
kill $MONITOR_PID
rm -f oom_test oom_test.c
rmdir $CGROUP_PATH
echo "Test complete"
Expected Output:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
Created cgroup with the following settings:
Memory limit: 104857600
OOM control: oom_kill_disable 1
Swappiness: 0
Running test program in cgroup...
PID: 12345
Allocating memory in 10 MB chunks every 500 ms
Press Ctrl+C to stop
Allocated 10 MB total
Current memory usage: 10485760 bytes
Allocated 20 MB total
Current memory usage: 20971520 bytes
...
Allocated 90 MB total
Current memory usage: 94371840 bytes
Allocated 100 MB total
malloc failed after 100 MB
Allocation complete.
Future Directions in Linux Memory Management
Advanced Swap Technologies
Several innovations are improving swap performance:
- zswap: Compressed cache for swap pages, reducing I/O
- zram: Compressed RAM-based swap device
- bcache: SSD-based caching for swap on slower devices
- swap on NVMe: Utilizing high-performance storage for swap
These technologies reduce the performance penalty of swapping, making it a more viable response to memory pressure.
Persistent Memory Integration
Intel Optane and similar persistent memory technologies are changing memory management:
- DAX (Direct Access): Bypasses the page cache for persistent memory
- KMEM DAX: Allows using persistent memory as normal RAM
- Tiered memory: Automatic placement of data based on access patterns
These capabilities create a continuum between memory and storage, potentially eliminating traditional swap in favor of direct persistent memory access.
Machine Learning for Memory Management
Emerging research applies machine learning to memory management:
- Predictive page replacement: Learning application memory access patterns
- Intelligent OOM scoring: Using workload history to make better termination decisions
- Adaptive parameter tuning: Automatically adjusting parameters based on workload
While still primarily research topics, these approaches show promise for workloads with predictable memory usage patterns.
Enhanced Memory Pressure Signaling
Improvements in memory pressure detection include:
- PSI (Pressure Stall Information): Fine-grained memory pressure metrics
- cgroup v2 memory controller: Improved isolation and accounting
- Pre-OOM notification hooks: Allowing applications to respond to pressure
These mechanisms enable more proactive responses to memory pressure, potentially avoiding OOM situations entirely.
Conclusion
Linux memory management during exhaustion represents a sophisticated balance between optimizing resource utilization and maintaining system stability. From initial allocation through reclamation, swapping, and potentially process termination, the kernel employs increasingly aggressive strategies to maintain functionality.
The OOM killer, while sometimes criticized for its choices, serves as a critical last line of defense against complete system failure. Modern improvements in kernel algorithms, hardware integration, and userspace tools continue to refine this system, making Linux increasingly resilient even under extreme memory pressure.
Understanding these mechanisms enables system administrators and developers to configure systems appropriately, design applications that behave well under memory pressure, and diagnose issues when memory exhaustion occurs despite preventive measures.